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Abstract — In this paper, we consider the problem of estimating 
finite rate of innovation (FRI) signals from noisy measurements, 
and specifically analyze the interaction between FRI techniques 
and the underlying sampling methods. We first obtain a funda- 
mental limit on the estimation accuracy attainable regardless 
of the sampling method. Next, we provide a bound on the 
performance achievable using any specific sampling approach. 
Essential differences between the noisy and noise-free cases arise 
from this analysis. In particular, we identify settings in which 
noise-free recovery techniques deteriorate substantially under 
slight noise levels, thus quantifying the numerical instability 
inherent in such methods. This instability, which is only present 
in some families of FRI signals, is shown to be related to a 
specific type of structure, which can be characterized by viewing 
the signal model as a union of subspaces. Finally, we develop a 
methodology for choosing the optimal sampling kernels based 
on a generalization of the Karhunen-Loeve transform. The 
results are illustrated for several types of time-delay estimation 
problems. 

Index Terms — Finite rate of innovation, Sampling, Cramer- 
Rao bound, Union of subspaces, Time-delay estimation 



I. Introduction 

The field of digital signal processing hinges on the availabil- 
ity of techniques for sampling analog signals, thus converting 
them to discrete measurements. The sampling mechanism aims 
to preserve the information present in the analog domain, 
ideally permitting flawless recovery of the original signal. For 
example, one may wish to recover a continuous-time signal 
x(t) from a discrete set of samples. The archetypical manifes- 
tation of this concept is the Shannon sampling theorem, which 
states that a S-bandlimited function can be reconstructed from 
samples taken at the Nyquist rate 2B [1]. 

Recently, considerable attention has been devoted to the 
extension of sampling theory to functions having a finite rate 
of innovation (FRI). These are signals determined by a finite 
number p of parameters per time unit [2]. Such a definition 
encompasses a rich variety of signals, including splines, shift- 
invariant signals, multiband signals, and pulse streams. In 
many FRI settings, several existing algorithms are guaranteed 
to recover the signal x(t) from samples taken at rate p [2]-[7]. 
In other words, signals which correspond to the FRI model can 
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be reconstructed from samples taken at the rate of innovation, 
which is potentially much lower than their Nyquist rate. 

The set of signals described by an FRI model can often 
be viewed as a union of subspaces [5]-[8]. For example, 
consider a stream of pulses parameterized by pulse locations 
and amplitudes. The set of all pulses having a given location 
is a subspace of the space of continuous-time functions. Thus, 
the set of all pulses having arbitrary locations is a union of 
such subspaces. As we will see, this point of view yields a 
flexible and productive framework for understanding the types 
of constraints implied by the model. 

Real-world signals are often contaminated by continuous- 
time noise and thus do not conform precisely to the FRI 
model. Furthermore, like any mathematical model, the FRI 
framework is an approximation which does not precisely hold 
in practical scenarios, an effect known as mismodeling error 
[8]. It is therefore of interest to quantify the effect of noise 
and mismodeling errors on FRI techniques [3], [6], [7], [9]. In 
the noisy case, it is no longer possible to perfectly recover the 
original signal from its samples. Nevertheless, one might hope 
for an appropriate finite-rate technique which achieves the 
best possible estimation accuracy, in the sense that increasing 
the sampling rate confers no further performance benefits. 
For example, to recover a B-bandlimited signal contaminated 
by continuous-time white noise, one can use an ideal low- 
pass filter with cutoff B prior to sampling at a rate of 2B. 
This strategy removes all noise components with frequencies 
larger than B, while leaving all signal components intact. 
Consequently, any alternative method which does not zero 
out frequencies above B can be improved upon, whereas 
methods which zero out some of the signal frequencies can 
suffer from an arbitrarily large reconstruction error. Thus, 
sampling at a rate of 2B is indeed optimal in the case of a 
B-bandlimited signal, if the signal is corrupted by continuous- 
time noise prior to sampling. Sampling at a rate higher than 
2B can be beneficial only when the sampling process itself 
introduces additional noise into the system, e.g., as a result of 
quantization. 

By contrast, empirical observations indicate that, for some 
noisy FRI signals, substantial performance improvements are 
achievable when the sampling rate is increased beyond the rate 
of innovation [4], [7]. Thus, in some cases, there appears to 
be a fundamental difference between the noiseless and noise- 
corrupted settings, in terms of the required sampling rate. Our 
first goal in this paper will be to provide an analytical justifica- 
tion and quantification of these empirical findings. As we will 
see, the fact that oversampling improves performance is not 
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merely indicative of flaws in existing algorithms; rather, it is 
a consequence of the inherent difficulty of reconstructing FR1 
signals under noise. Indeed, we will demonstrate that for some 
FRI signals, unless considerable oversampling is employed, 
performance will necessarily deteriorate by several orders of 
magnitude relative to the optimal achievable reconstruction 
capability. Such effects occur even when the noise level is 
exceedingly low. Our analysis will also enable us to identify 
and characterize the types of signals for which oversampling 
is necessary. 

To demonstrate these results, we first derive the Cramer-Rao 
bound (CRB) for estimating a finite-duration segment of an 
FRI signal x(t) directly from continuous-time measurements 
y(t) = x(t) + w(t), where w(t) is a Gaussian white noise 
process. This yields a lower bound on the accuracy whereby 
x(t) can be recovered by any technique, regardless of its 
sampling rate. This setting is to be distinguished from previous 
bounds in the FRI literature [9], [10] in three respects. First 
and most importantly, the measurements are a continuous- 
time process y(t) and the bound therefore applies regardless 
of the sampling method. Second, in our model, the noise 
is added prior to sampling. Thus, as will be shown below, 
even sampling at an arbitrarily high rate will not completely 
compensate for the noise. Third, we bound the mean-squared 
error (MSE) in estimating x(t) and not the parameters defining 
it, since we seek to determine the accuracy with which x(t) 
itself can be recovered. Such a bound does not depend on 
the specific parametrization of the signal, and consequently, 
possesses a simpler analytical expression. 

In practice, rather than processing the continuous-time 
signal y(t), it is typically desired to estimate x(t) from 
a discrete set of samples {c„} of y(t). In this scenario, 
in addition to the continuous-time noise w(t), digital noise 
may arise from the sampling process itself, for example due 
to quantization. To quantify the extent to which sampling 
degrades the ability to recover the signal, we next derive the 
CRB for estimating x(t) from the measurements {c n }. This 
analysis depends on the relative power of the two noise factors. 
When only digital noise is present, oversampling can be used 
to completely overcome its effect. On the other hand, when 
there exists only continuous-time noise, the bound converges 
to the continuous-time CRB as the sampling rate increases. In 
some cases, these bounds coincide at a finite sampling rate, 
which implies that the sampling scheme has captured all of 
the information present in the continuous-time signal, and any 
further increase in the sampling rate is useless. Conversely, 
when the continuous-time and sampled CRBs differ, the gap 
between these bounds is indicative of the degree to which 
information is lost in the sampling process. Our technique can 
then be used to plot the best possible performance as a function 
of the sampling rate, and thus provide the practitioner with a 
tool for evaluating the benefits of oversampling. 

When a certain sampling technique achieves the perfor- 
mance of continuous-time measurements, it can be identified 
using the method described above. However, in some cases 
no such technique exists, or the sampling rate it requires may 
be prohibitive. In these cases, it is desirable to determine 
the optimal sampling scheme having an allowed rate. Since 



different signals are likely to perform successfully with differ- 
ent sampling kernels, a Bayesian or average-case analysis is 
well-suited for this problem. Specifically, we assume that the 
signal x(t) has a known prior distribution over the class of 
signals, and determine the linear sampling and reconstruction 
technique which minimizes the MSE for recovering x(t) from 
its measurements. While nonlinear reconstruction techniques 
are commonly used and typically outperform the best linear 
estimator, this approach provides a simple means for identi- 
fying an appropriate sampling method. The resulting method 
can then be used in conjunction with standard nonlinear FRI 
algorithms. 

We demonstrate our results via the problem of estimating a 
finite-duration sequence of pulses having unknown positions 
and amplitudes [2], [4], [5], [7]. In this case, a simple sufficient 
condition is obtained for the existence of a sampling scheme 
whose performance bound coincides with the continuous-time 
CRB. This scheme is based on sampling the Fourier coeffi- 
cients of the pulse shape, and is reminiscent of recent time- 
delay estimation algorithms [7]. However, while the sampling 
scheme is theoretically sufficient for optimal recovery of x(t), 
we show that in some cases there is room for substantial 
improvement in the reconstruction stage of these algorithms. 
Finally, we demonstrate that the Fourier domain is also optimal 
(in the sense of minimizing the reconstruction MSE) when 
the sampling budget is limited. Specifically, given an allowed 
number of samples N, the reconstruction MSE is minimized 
by sampling the N highest-variance Fourier coefficients of the 
signal x{t). 

The rest of this paper is organized as follows. The problem 
setting is defined in Section II, and some examples of signals 
conforming to this model are presented in Section III. We then 
briefly summarize our main results in Section IV. In Section V, 
we provide a technical generalization of the CRB to general 
spaces. This result is used to obtain bounds on the achievable 
reconstruction error from continuous-time measurements (Sec- 
tion VI) and using a sampling mechanism (Section VII). Next, 
in Section VIII a Bayesian viewpoint is introduced and utilized 
to determine the optimal sampling kernels having a given rate 
budget. The results are demonstrated for the specific signal 
model of time-delay estimation in Section IX. 

II. Definitions 

A. Notation 

The following notation is used throughout the paper. A 
boldface lowercase letter v denotes a vector, while a boldface 
uppercase letter M denotes a matrix. Ijy is the N x N 
identity matrix. For a vector v, the notation ||v|| indicates 
the Euclidean norm. Given a complex number z £ C, the 
symbols z* and 5ft{z} denote the complex conjugate and the 
real part of z, respectively. For an operator P, the range space 
and null space are TZ(P) and Af(P), respectively, while the 
trace and adjoint are denoted, respectively, by Tr(P) and P* . 
The Kronecker delta, denoted by 8 m . n , equals 1 when m = n 
and otherwise. The expectation of a random variable v is 
written as E{u}. 
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The Hilbert space of square-integrable complex-valued 
functions over [0,Tb] is denoted L2[0,To] or simply L 2 - The 
corresponding inner product is 

</,<?> = P f(t)g*(t)dt (l) 

Jo 

and the induced norm is ||/||f, = (/,/)■ For an ordered set 
of K functions gi,. ■ . ,gn in L 2 , we define the associated set 
transformation G : C A — > L 2 as 



A" 



(Gv)(f) = 5>fcff*(t). 



k=l 



By the definition of the adjoint, it follows that 

G*/ = ((/,gi),-..,(/,.9x)) T . 



(2) 



(3) 



B. Setting 

In this work, we are interested in the problem of estimating 
FRI signals from noisy measurements. To define FRI signals 
formally, let the Xo-l° ca l number of degrees of freedom Nt if) 
of a signal x(t) at time t be the number of parameters defining 
the segment {x(t) : t G [t - T /2,t + T /2]}. The T -local 
rate of innovation of xifj is then defined as [2] 

N To f) 



PT 



max ■ 



T 



(4) 



We then say that x(t) is an FRI signal if px is finite for all 
sufficiently large values of T . In Section III, we will give 
several examples of FRI signals and compute their rates of 
innovation. 

For concreteness, let us focus on the problem of estimating 
the finite-duration segment {x(t) : t e [0,To]}, for some 
constant To, and let K = Nt q (To/2) denote the number of 
parameters defining this segment. We then have 



xeX = {h e e L 2 [0,T o ] : e 6} 



(5) 



where fig is a set of functions parameterized by the vector 6, 
and is an open subset of M. K . 

We wish to examine the random process 



y(t) = x(t) +w(t), te[0,T o ] 



(6) 



where wf) is continuous-time white Gaussian noise. Recall 
that formally, it is not possible to define Gaussian white noise 
over a continuous-time probability space [11]. Instead, we 
interpret (6) as a simplified notation for the equivalent set of 
measurements 

z(t)= [ x(r)dT +a c b(t), te[0,T ] (7) 
Jo 

where bf) is a standard Wiener process (also called Brownian 
motion) [12]. It follows that w(t) can be considered as a 
random process such that, for any /, g <G L 2 , the inner products 
a = (/, w) and b = (g, w) are zero-mean jointly Gaussian 
random variables satisfying E{a&*} = a^(f,g) [11]. The 
subscript c in a c is meant as a reminder of the fact that wif) is 
continuous-time noise. By contrast, when examining samples 
of the random process yif), we will also consider digital noise 
which is added during the sampling process. 



In this paper, we consider estimators which are functions 
either of the entire continuous-time process (6) or of some 
subset of the information present in (6), such as a discrete set 
of samples of yif). To treat these two cases in a unified way, 
let (f2,jF) be a measurable space and let {Pg : 9 e 0} be 
a family of probability measures over (f2, J^"). Let (3^, ^) be 
a measurable space, and let the random variable y : il —> y 
denote the measurements. This random variable can represent 
either yf) itself or samples of this quantity. 

An estimator can be defined in this general setting as a 
measurable function x : y — > The MSE of an estimator x 
at x is defined as 



MSE(£,x) =E{||:r-x||| 9 } = E<^ / \xf) - xf)\ 2 dt 



T 



(8) 



An estimator x is said to be unbiased if 



E{xf)} = x(t) for all x € X and almost all t e [0,T ]. (9) 

In the next section, we demonstrate the applicability of our 
model by reviewing several scenarios which can be formulated 
using the FRI framework. Some of these settings will also be 
used in the sequel to exemplify our theoretical results. 

III. Types of FRI Signals 

Numerous FRI signal structures have been proposed and 
analyzed in the sampling literature. Whereas most of these can 
be treated within our framework, some FRI structures do not 
conform exactly to our problem setting. Thus, before delving 
into the derivation of the CRB, we first provide examples for 
scenarios that can be analyzed via our model and discuss some 
of its limitations. 



A. Shift-Invariant Spaces 

Consider the class of signals that can be expressed as 



x(t) = a[m]g(t — mT) 



(10) 



with some arbitrary square-integrable sequence {a[m]} TO£ z, 
where gif) is a given pulse in L^i^) and T > is a given 
scalar. This set of signals is a linear subspace of L2OR), which 
is often termed a shift-invariant (SI) space [13], [14]. The 
class of functions that can be represented in the form (10) 
is quite large. For example, choosing git) = sinc(t/T) leads 
to the subspace of 7r/T-bandlimited signals. Other important 
examples include the space of spline functions (obtained by 
letting gf) be a B-spline function) and communication signals 
such as pulse-amplitude modulation (PAM) and quadrature 
amplitude modulation (QAM). Reconstruction in SI spaces 
from noiseless samples has been addressed in [15], [16] and 
extended to the noisy setting in [17]— [19]. 

Intuitively, every signal lying in a SI space with spacing T 
has one degree of freedom per T seconds (corresponding to 
one coefficient from the sequence {a[m]}). It is thus tempting 
to regard the rate of innovation of such signals as 1/T. 
However, this is only true in an asymptotic sense and for 
compactly supported pulses gif). For any finite window size 
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To, the To-local rate of innovation px is generally larger. 
Specifically, suppose that the support of git) is contained in 
[ta)£&] an d consider intervals of the form [t,t + MT], where 
M is an integer. Then, due to the overlaps of the pulses, for 
any such interval we can only assure that there are no more 
than M + \(tf, — t a )/T~\ coefficients affecting the values of 
x(t). Thus, the MT-local rate of innovation of signals of the 
form (10) is given by 



PMT 




(ID 



In particular, signals of the form (10) having a generator g(t) 
which is not compactly supported have an infinite T) -local 
rate of innovation, for any finite T>. This is the case, for 
example, with bandlimited signals, which are therefore not 
FRI functions under our definition. As will be discussed in the 
sequel, this is not a flaw of the definition we use for the rate 
of innovation. Rather, it reflects the fact that it is impossible 
to recover any finite-duration segment [Ti, T2] of such signals 
from a finite number of measurements. 

B. N online arly- Distorted Shift-Invariant Spaces 

In certain communication scenarios, nonlinearities are in- 
troduced in order to avoid amplitude clipping, an operation 
known as companding [20]. When the original signal lies in 
a SI space, the resulting transmission takes the form 



x(t) 



^2 a H.9(* - mT) 



(12) 



where r(-) is a nonlinear, invertible function. Clearly, the MT- 
local rate of innovation pmt of this type of signals is the same 
as that of the underlying SI function, and is thus given by (11). 
The recovery of nonlinearly distorted SI signals from noiseless 
samples was treated in [20]-[23]. We are not aware of research 
works treating the noisy case. 

C. Union of Subspaces 

Much of the FRI literature treats signal classes which are 
unions of subspaces [5], [6], [8], [24]. We now give examples 
of a few of these models. 

1) Finite Union of Subspaces: There are various situations 
in which a continuous-time signal is known to belong to one 
of a finite set of spaces. One such signal model is described 
by 



K 



x(t) = a k [m]g k (t - mT), 



(13) 



where {gk(fj}k=i are a set °f generators. In this model 
it is assumed that only L < K out of the K sequences 
{ai[m]} meZ , . . . , {a K [m]} meZ are not identically zero [25]. 
Therefore, the signal x(t) is known to reside in one of (^) 
spaces, each of which is spanned by an T-element subset of 
the set of generators {gk(t)}k—v This class of functions can 
be used to describe multiband signals [24], [26]. However, 
the discrete nature of these models precludes analysis using 
the differential tools employed in the remainder of this paper. 



Therefore, in this work we will focus on infinite unions of 
subspaces. 

2) Single-Burst Channel Sounding: In certain medium 
identification and channel sounding scenarios, the echoes of 
a transmitted pulse g(t) are analyzed to identify the positions 
and reflectance coefficients of scatterers in the medium [7], 
[27]. In these cases, the received signal has the form 



x(t) =^2a e g{t - ti), 



(14) 



(=1 



where L is the number of scatterers and the amplitudes 
{ai}f =1 and time-delays \ti\\ =x correspond to the reflectance 
and location of the scatterers. Such signals can be thought of 
as belonging to a union of subspaces, where the parameters 
{te}f =1 determine an T-dimensional subspace, and the coef- 
ficients {a,i}f =1 describe the position within the subspace. In 
contrast with the previous example, however, in this setting 
we have a union of an infinite number of subspaces, since 
there are infinitely many possible values for the parameters 
t\, . . . , tL- 

In this case, for any window of size To > max^{^} — 
mirif{<f}, the T)-local rate of innovation is given by 



2L 

PT - 

Jo 



(15) 



3) Periodic Channel Sounding: Occasionally, channel 
sounding techniques consist of repeatedly probing the medium 
[28]. Assuming the medium does not change throughout the 
experiment, the result is a periodic signal 



x(t) = X X aeg ^ ~ ^ _ mT )- 



(16) 



mGZ 1=1 



As before, the set X of feasible signals is an infinite union of 
finite-dimensional subspaces in which {te}f =1 determine the 
subspace and {ae}f =1 define the position within the subspace. 
The T)-local rate of innovation in this case coincides with (15). 

4) Semi-Periodic Channel Sounding: There are situations 
in which a channel consists of L paths whose amplitudes 
change rapidly, but the time delays can be assumed constant 
throughout the duration of the experiment [5], [28], [29]. In 
these cases, the output of a channel sounding experiment will 
have the form 

L 

x{ f) = X) £ a *H<?(* -ti- mT), (17) 

where ag [m] is the amplitude of the £th path at the mth probing 
experiment. This is, once again, a union of subspaces, but here 
each subspace is infinite-dimensional, as it is determined by 
the infinite set of parameters {ci£[to]}. In this case, the MT- 
local rate of innovation can be shown to be 



Pmt 




(18) 
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5) Multiband Signals: Multiuser communication channels 
are often characterized by a small number of utilized sub- 
bands interspersed by large unused frequency bands [26]. The 
resulting signal can be described as 

L 

= E E a <Ms(* - my**', d9) 

where {a^[n]} ne z is the data transmitted by the Ah user, 
and bjg is the corresponding carrier frequency. In some cases 
the transmission frequencies are unknown [24], [26], resulting 
in an infinite union of infinite-dimensional subspaces. This 
setting is analogous in many respects to the semi-periodic 
channel sounding case; in particular, the MT-local rate of 
innovation can be shown to be the same as that given by (18). 

IV. Summary of Main Results 

Before delving into the mathematical details, we provide in 
this section a high-level overview of our main contributions 
and summarize the resulting conclusions. 

The overarching objective of this paper is to design and 
analyze sampling schemes for reconstructing FRI signals from 
noisy measurements. This goal is accomplished in three stages. 
First, we identify the best achievable MSE for estimating 
an FRI signal x(t) from its continuous-time measurements 
y(t) = x(t) + w(t), providing a fundamental lower bound 
which is independent of the sampling method. We then 
compare this continuous-time bound with the lowest possible 
MSE for a given sampling scheme, thus measuring the loss 
entailed in any particular technique. Finally, we provide a 
mechanism for choosing the optimal sampling kernels (in a 
specific Bayesian sense) utilizing a pre-specified sampling rate 
budget. Our results can be applied to specific families of FRI 
signals, but they also yield some general conclusions as to the 
relative difficulty of various classes of estimation problems. 
These general observations are summarized below. 

A. Continuous-Time Bound 

Our first goal in this paper is to derive the continuous- 
time CRB, which defines a fundamental limit on the accuracy 
with which an FRI signal can be estimated, regardless of the 
sampling technique. This bound turns out to have a particularly 
simple closed form expression which depends on the rate 
of innovation, but not on the class X of FRI signals being 
estimated. Specifically, under suitable regularity conditions, 
the MSE of any unbiased estimator x satisfies 

^MSE(x,x)> PTo al (20) 

Thus, the rate of innovation can be given a new interpretation 
as the ratio between the best achievable MSE and the noise 
variance a 2 c . This is to be contrasted with the characterization 
of the rate of innovation in the noise-free case as the lowest 
sampling rate allowing for perfect recovery of the signal; 
indeed, when noise is present, perfect recovery is no longer 
possible. 



B. Bound for Sampled Measurements 

We next consider lower bounds for estimating x(t) from 
samples of the signal y(t). In this setting, the samples in- 
herit the noise w(t) embedded in the signal y(t), and may 
suffer from additional discrete-time noise, for example, due 
to quantization. We derive the CRB for estimating x(t) from 
sampled measurements in the presence of both types of noise. 
However, the combination of the two noise models complicates 
the mathematical analysis. Consequently, since the sampling 
noise model has been previously analyzed [9], [10], we focus 
in this paper on the assumption that the discrete-time noise is 
negligible. 

In this setting, the sampled CRB can be designed so as 
to converge to the continuous-time bound as the sampling 
rate increases. Moreover, if the family X of FRI signals is 
contained in a finite-dimensional subspace Ai of L2, then a 
sampling scheme achieving the continuous-time CRB can be 
constructed. Such a sampling scheme is obtained by choosing 
kernels which span the subspace Ai, and yields samples 
which fully capture the information present in the signal y(t). 
Contrariwise, if X is not contained in a finite-dimensional 
subspace, then no finite-rate sampling method achieves the 
continuous-time CRB. In this case, any increase in the sam- 
pling rate can improve performance, and the continuous-time 
bound is obtained only asymptotically. 

It is interesting to examine this distinction from a union 
of subspaces viewpoint. Suppose that, as in the examples of 
Section III-C, the family X can be described as a union of an 
infinite number of subspaces {U a } indexed by the continuous 
parameter a, so that 

X = \JU a . (21) 

a 

In this case, a finite sampling rate captures all of the informa- 
tion present in the signal if and only if 

dim ^E Wq ^ < 00 (22) 

where dim(A^) is the dimension of the subspace Ai. By 
contrast, in the noise-free case, it has been previously shown 
[30] that the number of samples required to recover x(t) is 
given by 

max dim(U ai + U ao ), (23) 

i.e., the largest dimension among sums of two subspaces 
belonging to the union. In general, the dimension of (22) will 
be much higher than (23), illustrating the qualitative difference 
between the noisy and noise-free settings. For example, if 
the subspaces U a are finite-dimensional, then (23) is also 
necessarily finite, whereas (22) need not be. 

Nevertheless, one may hope that the structure embodied in 
X will allow nearly optimal recovery using a sampling rate 
close to the rate of innovation. This is certainly the case 
in many noise-free FRI settings. For example, there exist 
techniques which recover the pulse stream (14) from samples 
taken at the rate of innovation, despite the fact that in this case 
X is typically not contained in a finite-dimensional subspace. 



6 



However, this situation often changes when noise is added, in 
which case standard techniques improve considerably under 
oversampling. This empirical observation can be quantified 
using the CRB: as we show, the CRB for samples taken at 
the rate of innovation is substantially higher in this case than 
the optimal, continuous-time bound. This demonstrates that 
the sensitivity to noise is a fundamental aspect of estimating 
signals of the form (14), rather than a limitation of existing 
algorithms. On the other hand, other FRI models, such as the 
semi-periodic pulse stream (17), exhibit considerable noise 
resilience, and indeed in these cases the CRB converges to 
the continuous-time value much more quickly. 

As we discuss in Section IX-E, the different levels of 
robustness to noise can be explained when the signal models 
are examined in a union of subspaces context. In this case, the 
parameters 6 defining x(t) can be partitioned into parameters 
defining the subspace U a and parameters pinpointing the 
position within the subspace. Our analysis hints that estimation 
of the position within a subspace is often easier than estimation 
of the subspace itself. Thus, when most parameters are used 
to select an intra-subspace position, estimation at the rate of 
innovation is successful, as occurs in the semi-periodic case 
(17). By contrast, when a large portion of the parameters 
define the subspace in use, a sampling rate higher than 
the rate of innovation is necessary; this is the case in the 
non-periodic pulse stream (14), wherein 6 is evenly divided 
among subspace-selecting parameters {ti} and intra-subspace 
parameters {ae}. Thus we see that the CRB, together with 
the union of subspaces viewpoint, provide valuable insights 
into the relative degrees of success of various FRI estimation 
techniques. 

C. Choosing the Sampling Kernels 

In some cases, one may choose the sampling system 
according to design specifications, leading to the question: 
What sampling kernels should be chosen given an allotted 
number of samples? We tackle this problem by adopting a 
Bayesian framework, wherein the signal x(t) is a random 
process distributed according to a known prior distribution. 
We further assume that both the sampling and reconstruction 
techniques are linear. While nonlinear reconstruction methods 
are often used for estimating FRI signals, this assumption is 
required for analytical tractability, and is used only for the 
purpose of identifying sampling kernels. Once these kernels 
are chosen, they can be used in conjunction with nonlinear 
reconstruction algorithms. 

Under these assumptions, we identify the sampling kernels 
yielding the minimal MSE. An additional advantage of our 
assumption of linearity is that in this case, the optimal kernels 
depend only on the autocorrelation 

Rx(t,r) = E{x(t)x*(r)} (24) 

of the signal x(t), rather than on higher-order statistics. Indeed, 
given a budget of N samples, the optimal sampling kernels are 
given by the N eigenfunctions of Rx corresponding to the 
N largest eigenvalues. This is reminiscent of the Karhunen- 
Loeve transform (KLT), which can be used to identify the 



optimal sampling kernels in the noiseless setting. However, 
in our case, shrinkage is applied to the measurements prior 
to reconstruction, as is typically the case with Bayesian 
estimation of signals in additive noise. 

A setting of particular interest occurs when the autocorre- 
lation Rx is cyclic, in the sense that 

Rx(t,r) = Rx{{t - r)modT) (25) 

for some T. This scenario occurs, for example, in the periodic 
pulse stream (16) and the semi -periodic pulse stream (17), 
assuming a reasonable prior distribution on the parameters 9. 
It is not difficult to show that the eigenfunctions of Rx are 
given, in this case, by the complex exponentials 

^ n (t) = -^=e^ nt , n g Z. (26) 

Furthermore, in the case of the periodic and semi-periodic 
pulse streams, the magnitudes of the eigenvalues of Rx 
are directly proportional to the magnitudes of the respective 
Fourier coefficients of the pulse shape g(t). It follows that 
the optimal sampling kernels are the exponentials (26) corre- 
sponding to the largest Fourier coefficients of g(t). This result 
is encouraging in light of recently proposed FRI reconstruction 
techniques which utilize exponential sampling kernels [6], 
and demonstrates the suitability of the Bayesian approach for 
designing practical estimation kernels. 

V. Mathematical Prerequisites: CRB for General 
Parameter Spaces 

In statistics and signal processing textbooks, the CRB is typ- 
ically derived for parameters belonging to a finite-dimensional 
Euclidean space [31]— [33]. However, this result is insufficient 
when it is required to estimate a parameter x belonging to 
other Hilbert spaces, such as the L2 space defined above. 
When no knowledge about the structure of the parameter x 
is available, a bound for estimating x(t) from measurements 
contaminated by colored noise was derived in [34]. However, 
this bound does not hold when the noise w(i) is white. Indeed, 
in the white noise case, it can be shown that no finite-MSE 
unbiased estimators exist, unless further information about 
x(t) is available. For example, the naive estimator x(t) = y(t) 
has an error x(t) — x{t) equal to w(t), whose variance is 
infinite. 

In our setting, we are given the additional information that 
x belongs to the constraint set X of (5). To the best of our 
knowledge, the CRB has not been previously defined for any 
type of constraint set X 6 L2, a task which will be accom- 
plished in the present section. As we show below, a finite- 
valued CRB can be constructed by requiring unbiasedness 
only within the constraint set X, as per (9). As we will 
see, the CRB increases linearly with the dimension of the 
manifold X. Thus, in particular, the CRB is infinite when 
X = L2- However, for FRI signals, the dimension of X is 
finite by definition, implying that a finite-valued CRB can be 
constructed. Although the development of this bound invokes 
some deep concepts from measure theory, it is a direct analog 
of the CRB for finite-dimensional parameters [32, Th. 2.5.15]. 
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To derive the bound in the broadest setting possible, in this 
section we temporarily generalize the scenario of Section II, 
and consider estimation of a parameter x belonging to an 
arbitrary measurable and separable Hilbert space H. The MSE 
of an estimator x in this setting is defined as MSE(x, x) = 
E{||x — The concept of bias can similarly be extended 

if one defines the expectation K{v} of a random variable 
v : — > W as an element k E H such that (k, ip) = E{(u, (/?}} 
for any ip € H. If no such element exist, the expectation is 
said to be undefined. 

The derivation of the CRB requires the existence of a "prob- 
ability density" pg{y) (more precisely, a Radon-Nikodym 
derivative) which is differentiable with respect to 9, and such 
that its differentiation with respect to 9 can be interchanged 
with integration with respect to y. The CRB also requires 
the mapping hg between 9 and x to be non-redundant and 
differentiable. The formal statement of these regularity con- 
ditions is specified below. For the measurement setting (6), 
with reasonable mappings hg, these conditions are guaranteed 
to hold, as we will demonstrate in the sequel; in this section, 
however, we list these conditions in full so that a more general 
statement of the CRB will be possible. 

PI) There exists a value 9q G 6 such that the measure Pg 
dominates {Pg : 9 <E 0}. In other words, there exists a 
Radon-Nikodym derivative pg (y) = dPg / dPg a such that, 
for any event /let, 



Po(A)= / P g(y)Pg (dy)- 



(27) 



P2) For all y such that pg(y) > 0, the functions pg(y) 
and \ogpg(y) are continuously differentiable with respect 
to 9. We denote by dpg(y)/d9 and 3\ogpg(y) / 39 the 
column vectors of the gradients of these two functions. 

P3) The support {y G y : pg(y) > 0} of pg(y) is independent 
of 9. 

P4) There exists a measurable function gi^xS^l such 
that for all sufficiently small A > 0, for alii = 1 , . . . , K, 
for all y, and for all 9, 

^\pg+A e M-Pe(y)\<q(y,0) (28) 



and such that for all 9, 

q 2 {y,9)P 0o {dy) < oo. 



(29) 



In (28), represents the ith column of the KxK identity 
matrix. 

P5) For each 9, the K x K Fisher information matrix (FIM) 

'd\ogp e {y)\ ( d\ogp e {y)\*~ 



Je =E 



39 



06 



(30) 



is finite and invertible. 
P6) hg is Frechet differentiable with respect to 9, in the sense 
that for each 9, there exists a continuous linear operator 
dhg jd9 :R K -> L 2 such that, for any sufficiently small 

8 G M. K , 

hg+s - hg dhg 



89 



■« + o(||*||) as ||<5||^0. (31) 



P7) The null space of the mapping dhg jd9 contains only the 
zero vector. This assumption is required to ensure that the 
mapping from 9 to x is non-redundant, in the sense that 
there does not exist a parametrization of X in which the 
number of degrees of freedom is smaller than K. 
We are now ready to state the CRB for the estimation of a 
parameter x G L,2[0,T] parameterized by a finite-dimensional 
vector 9. The proof of this theorem is given in Appendix A. 

Theorem 1. Let 9 G be a deterministic parameter, where 
is an open set in R . Let H be a measurable, separable 
Hilbert space and let hg be a mapping from to Ji. Let {Pg : 
9 G 0} be a family of probability measures over a measurable 
space (fi, and let y : O — > y be a random variable, where 
y is a measurable Hilbert space. Assume regularity conditions 
P1-P6. Let x : y — > % be an unbiased estimator of x from 
the measurements y such that 



i{\\Hy)\\n} < 



DC'. 



Then, the MSE of x satisfies 

MSE(i, x) > Tr 
where Jg is the FIM (30). 



( dhg 

I 89 



dhg 
39 



J. 1 



(32) 



(33) 



Theorem 1 enables us to obtain a lower bound on the 
estimation error of x based on the FIM for estimating 9. The 
latter can often be computed relatively easily since 9 is a 
finite-dimensional vector. Even more conveniently, the trace 
on the right-hand side of (33) is taken over a K x K matrix, 
despite the involvement of continuous-time operators. Thus, 
the computation of (33) is often possible either analytically 
or numerically, a fact which will be used extensively in the 
sequel. 

VI. CRB for Continuous-Time Measurements 

We now apply Theorem 1 to the problem of estimating a 
deterministic signal x from continuous-time measurements y 
given by (6). 

Theorem 2. Let x be a deterministic function defined by (5), 
where 9 G is an unknown deterministic parameter and 
is an open subset of M. K . Suppose that Assumptions P6-P7 
are satisfied. Then, the MSE of any unbiased, finite-variance 
estimator x is bounded by 



MSE(.i, x) > Ka\ 



(34) 



The bound of Theorem 2 can be translated to units of 
the rate of innovation pr if we assume that the segment 
[0, To] under analysis achieves the maximum (4), i.e., this is a 
segment containing the maximum allowed number of degrees 
of freedom. In this case pt q = K/Tq, and any unbiased 
estimator x(t) satisfies 



^{f^\x(t)-x(t)\*dt} 



> Pt 



(35) 



In the noisy setting, px loses its meaning as a lower bound 
on the sampling rate required for perfect recovery, since the 
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latter is no longer possible at any sampling rate. On the other 
hand, it follows from (35) that the rate of innovation gains 
an alternative meaning; namely, pT is a lower bound on the 
ratio between the average MSE achievable by any unbiased 
estimator and the noise variance a 2 , regardless of the sampling 
method. 

Before formally proving Theorem 2, note that (34) has an 
intuitive geometric interpretation. Specifically, the constraint 
set X is a A'-dimensional differential manifold in L,2[0,T]. In 
other words, for any point x G X, there exists a A'-dimensional 
subspace U tangent to X at x. We refer to U as the feasible 
direction subspace [35]: any perturbation of x which remains 
within the constraint set X must be in one of the directions in 
U. Formally, U can be defined as the range space of dhe j 86. 

If one wishes to use the measurements y to distinguish be- 
tween x and its local neighborhood, then it suffices to observe 
the projection of y onto U. Projecting the measurements onto 
U removes most of the noise, retaining only K independent 
Gaussian components, each having a variance of a 2 . Thus we 
have obtained an intuitive explanation for the bound of Ka 2 in 
Theorem 2. To formally prove this result, we apply Theorem 1 
to the present setting, as follows. 

Proof of Theorem 2: The problem of estimating the 
parameters 6 from a continuous-time signal y(t) of the form 
(6) was examined in [12, Example 1.7.3], where the validity of 
Assumptions P1-P4 was demonstrated. It was further shown 
that the FIM Jg ont for estimating 6 from y(t) is given by [12, 
ibid. ] 

i ( dhe X ( dhe 



Jcont 
e — — 



(36) 



Our goal will be to use (36) and Theorem 1 to obtain a 
bound on estimators of the continuous-time function x(t). 
To this end, observe that the FIM Jg ont is finite since, by 
Assumption P6, the operator dhe/ 86 is a bounded operator 
into L-2- Furthermore, by Assumption P7, dho/dO has a 
trivial null space, and thus Jjj ont is invertible. Therefore, 
Assumption P5 has been demonstrated. We may consequently 
apply Theorem 1, which yields 



MSE(x,x) 



> ai Tr 



dh e 
86 



dh e 
86 



8h e 
86 



8h e 
86 



a 2 c Tr(I K ) 
Ka 2 



(37) 



thus completing the proof. ■ 
To illustrate the use of Theorem 2 in practice, let us consider 
as a simple example a signal x(t) belonging to a finite- 
dimensional subspace Q. Specifically, assume that 



K 



(38) 



k=l 



for some coefficient vector 6 = (ai, . . . ,ax) T and a given 
set of linearly independent functions {g k } spanning Q. This 
includes, for example, families of shift-invariant subspaces 
with a compactly supported generator (see Section III-A). 



From Theorem 2, the MSE of any unbiased estimator of x is 
bounded by Ka 2 , where K is the dimension of the subspace 
Q. We now demonstrate that this bound is achieved by the 
unbiased estimator 

x = Pe y (39) 

where Pg is the orthogonal projector onto the subspace Q. 

To verify that (39) achieves the CRB, let G denote the set 
transformation (2) associated with the functions {gk}k=v One 
may then write x = G6 and Pg = G{G*G)~ 1 G*. Thus (39) 
becomes 

x = G(G*G)- 1 G*G6 + G{G*G)- 1 G*w 
= G6 + P g w (40) 



and therefore 



E{p-x||| 2 }=E{||P e HlU 



(41) 



Since Q is a A'-dimensional subspace, it is spanned by a set 
of A orthonormal 1 functions ui, . . . , uk G £2- Thus 



K 

E{||P<HI! 2 } = E E { l(u; ' Ufc)|2 } 



Kai 



(42) 



which demonstrates that x indeed achieves the CRB in this 
case. 

In practice, a signal is not usually estimated directly from 
its continuous-time measurements. Rather, the signal y(t) 
is typically sampled and digitally manipulated. In the next 
section, we will compare the results of Theorem 2 with 
the performance achievable from sampled measurements, and 
demonstrate that in some cases, a finite-rate sampling scheme 
is sufficient to achieve the continuous-time bound of Theo- 
rem 2. 

VII. CRB for Sampled Measurements 

In this section, we consider the problem of estimating x(t) 
of (5) from a finite number of samples of the process y(t) 
given by (6). Specifically, suppose our measurements are given 
by 

c n = (y, s n ) + v n = / y(t)s* n (t)dt + v n , n=l,...,N 
Jo 

(43) 

where {s„}^ =1 C L2[0,Io] are sampling kernels, and v n is 
a discrete white Gaussian noise process, independent of w(t), 
having mean zero and variance a 2 . Note that the model (43) 
includes both continuous-time noise, which is present in the 
signal y(t) = x(t) + w(t) prior to sampling, and digital noise 
v n , which arises from the sampling process, e.g., as a result of 
quantization. In this section, we will separately examine the 
effect of each of these noise components. 

From (6) and (43), it can be seen that the measurements 
ci, . . . , cn are jointly Gaussian with mean 



fi n = E{c„} = (x, s n ) 



(44) 



We require the new functions u\ , . . . , uk since the functions g\ , . . . , 
are not necessarily orthonormal. The choice of non-orthonormal functions 
91 1 ■ ■ • 1 9k w iH prove useful in the sequel. 
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and covariance 



r, 



Cov(c 4 , Cj) = a\ (si,Sj) + a^8 i: 



(45) 



A somewhat unusual aspect of this estimation setting is that 
the choice of the sampling kernels s n {t) affects not only the 
measurements obtained, but also the statistics of the noise. One 
example of the impact of this fact is the following. Suppose 
first that no digital noise is present, i.e., ad = 0, and consider 
a modified set of sampling kernels {s n (t)}n=i which are an 
invertible linear transformation of {s n (t)}^ =1 , so that 



JV 



(46) 



where B £ M. NxN is an invertible matrix. Then, the resulting 
measurements c are given by c = Be, and similarly the 
original measurements c can be recovered from c. It follows 
that these settings are equivalent in terms of the accuracy with 
which x can be estimated. In particular, the FIM for estimating 
x in the two settings is identical [12, Th. 1.7.2]. 

When digital noise is present in addition to continuous- 
time noise, the sampling schemes {s n {t)} and {s n (t)} are 
no longer necessarily equivalent, since the gain introduced by 
the transformation B will alter the ratio between the energy of 
the signal and the digital noise. The two estimation problems 
are then equivalent if and only if B is a unitary transformation. 

How should one choose the space S = span{si, . . . , sn} 
spanned by the sampling kernels? Suppose for a moment that 
there exist elements in the range space of dhg j 89 which 
are orthogonal to S. This implies that one can perturb x in 
such a way that the constraint set X is not violated, without 
changing the distribution of the measurements c. This situation 
occurs, for example, when the number of measurements N 
is smaller than the dimension K of the parametrization of 
X. While it may still be possible to reconstruct some of the 
information concerning x from these measurements, this is an 
undesirable situation from an estimation point of view. Thus 
we will assume henceforth that 



■Its?) 



(47) 



As an example of the necessity of the condition (47), con- 
sider again the signal (38), which belongs to a A'-dimensional 
subspace Q C £2 spanned by the functions gi, . . . ,gx- In this 
case it is readily seen that for any vector v 



dhg 
89 



K 

E 

fc=i 



Vfc5fc(£). 



(48) 



Since the functions {g k } span the space Q, this implies that 
lZ(dhg/d9) = Q, and therefore the condition (47) can be 
written as 

^05^ = {0} (49) 

which is a standard requirement in the design of a sampling 
system for signals belonging to a subspace Q [36]. 

By virtue of Theorem 1, a lower bound on unbiased 
estimation of x can be obtained by first computing the FIM 
jsamp est j matm g q f rom c This yields the following 



result. For simplicity of notation, in this theorem we assume 
that the function hg and the sampling kernels s n are real. If 
complex sampling kernels are desired (as will be required in 
the sequel), the result below can still be used by translating 
each measurement to an equivalent pair of real-valued samples. 

Theorem 3. Let x be a deterministic real function defined 
by (5), where 9 £ is an unknown deterministic parameter 
and is an open subset ofM. K . Assume regularity conditions 
P6-P7, and let x be an unbiased estimator of x from the 
real measurements c = (c±, . . . , cm) T of (43). Then, the FIM 
for estimating 9 from c is given by 



-rsamp 
J 9 



Tsamp 



dh e 
89 



S (a 2 S* S + a 2 A 



d 1 N 



s* 



dhg 
89 



(50) 



where S is the set transformation corresponding to the func- 
tions {snj^Li- V (47) holds, then Jg amp is invertible. In this 
case, any finite-variance, unbiased estimator x for estimating 
x from c satisfies 



MSE(£,.t) > Tr 



dhg 
89 



8h e 
89 



(51) 



[31] 



/ ysamp \ - 

Proof: In the present setting, the FIM J^ amp is given by 



|Samp 



89 



90 



(52) 



where the matrix T 6 R NxN is defined by (45) and the matrix 
8 fx/ 89 eR NxK is given by 

8fl\ _ 8jin 

de nh r 80 k 



(53) 



with defined in (44). 

By the definition of the set transformation, the ijth element 
of the N x N matrix S*S is given by 



(S*S)ij = (Sej,Sei) = (s,-,Si) 



(54) 



where is the ith column of the N x N identity matrix. 
Therefore, we have 



r = <j 2 s*s + c^Iat. 

Similarly, observe that 



(55) 



S 



* 8hg 

~~89 



8hg „ \ - / Uhl \ _ 
89 Gk ' ben -\d9 k ' Sn ~ 88 k 



(56) 



where is the fcth column of the K x K identity matrix. 
Thus 

d/j, _ 8h e 

89 ~ b ~89~- (5?) 
Substituting (55) and (57) into (52) yields the required expres- 
sion (50). 

We next demonstrate that if (47) holds, then Jg amp is 
invertible. To see this, note that from (50) we have 



(58) 



Now, consider an arbitrary function / £ 7Z(8hg/89). If (47) 
holds, then / is not orthogonal to the subspace S. Therefore, 



LO 



(/> s n) 7^ for at least one value of n, and thus by (3), S* f ^ 
0. This implies that 



Af S 



dh e 
89 



{0}. 



(59) 



Combined with (58), we conclude that A/"(J^ mp ) = {0}. This 
demonstrates that J e amp is invertible, proving Assumption 
P5. Moreover, in the present setting, Assumptions P1-P4 are 
fulfilled for any value of 0q [32]. Applying Theorem 1 yields 
(51) and completes the proof. ■ 
In the following subsections we draw several conclusions 
from Theorem 3. 



A. Discrete-Time Noise 

Suppose first that ct 2 = 0, so that only digital noise 
is present. This setting has been analyzed previously [9], 
[10], and therefore only briefly examine the contrast with 
continuous-time noise. When only digital noise is present, 
its effects can be surmounted either by increasing the gain 
of the sampling kernels, or by increasing the number of 
measurements. These intuitive conclusions can be verified 
from Theorem 3 as follows. Assume that condition (47) holds, 
and consider the modified kernels s n (t) = 2s n (t). The set 
transformation S corresponding to the modified kernels is S = 
2S, and since ct 2 = 0, this implies that the FIM obtained from 
the modified kernels is given by J e amp = 4J e amp . Thus, a 
sufficient increase in the sampling gain can arbitrarily increase 
jsamp an( j conse q Uen tiy re duce the bound (51) arbitrarily close 
to zero. 

Of course, from a practical point of view, increasing the 
gain also increases the likelihood that the sampled signal will 
exceed the dynamic range of the quantizer. It is therefore not 
feasible to arbitrarily increase the sampling gain. As an alter- 
native, it is possible to increase the number of measurements. 
For example, suppose one simply repeats each measurement 
twice. Let S and S denote the transformations corresponding 
to the original and doubled sets of measurements. It can then 
readily be seen from the definition of the set transformation 
(2) and its adjoint (3) that SS* = 4SS*. Consequently, by the 
same argument, in the absence of continuous-time noise one 
can achieve arbitrarily low error by repeated measurements, 
or more generally, by increasing the sampling rate. 



B. Continuous-Time Noise 

As we have seen, sampling noise can be mitigated by 
increasing the sampling rate. Furthermore, digital noise is 
inherently dependent on the sampling scheme being used. 
Since our goal is to determine the fundamental performance 
limits regardless of the sampling technique, we will focus here 
and in subsequent sections on continuous-time noise. Thus, 
suppose that ct 2 . = 0, so that only continuous-time noise is 
present. In this case, as we now show, it is generally impossible 
to achieve arbitrarily low reconstruction error, regardless of 
the sampling kernels used; indeed, it is never possible to 
outperform the continuous-time CRB of Section VI, which 
is typically nonzero. To see this formally, observe first that in 



the absence of digital noise, the FIM for estimating 9 can be 
written as 



Tsamp 



1 (?b° 

2 ^ 89 
1_ / 8h e 
2 V 89 



s(s*sy 1 s* 



8h e 
89 



8h g 
89 



(60) 



where P5 is the orthogonal projection onto the subspace 
S. It is insightful to compare this expression with the FIM 
Jg ont obtained from continuous-time measurements in (36). 
In both cases, a lower bound on the MSE for unbiased 
estimation of x was obtained from Je by applying Theorem 1 . 
Consequently, if it happens that Jg ont = J e amp , then the 
continuous-time bound of Theorem 2 and the sampled bound 
of Theorem 3 coincide. Thus, if no digital noise is added, then 
it is possible (at least in terms of the performance bounds) that 
estimators based on the samples c will suffer no degradation 
compared with the "ideal" estimator based on the entire set 
of continuous-time measurements. This occurs if and only if 
7Z(8he/89) C S; in this case, the projection P5 will have 
no effect on the FIM J e amp , which will then coincide with 
jcont Q f (26), In the remainder of this section, we will discuss 
several cases in which this fortunate circumstance arises. 



C. Example: Sampling in a Subspace 

The simplest situation in which samples provide all of the 
information present in the continuous-time signal is the case 
in which x(t) belongs to a A'-dimensional subspace Q of L2. 
This is the case, for example, when the signal lies in a shift- 
invariant subspace having a compactly supported generator 
(see Section III-A). As we have seen above (cf. (48)), in this 
scenario dhe j 89 is a mapping onto the subspace Q. Assuming 
that there is no discrete-time noise, it follows from (60) that 
the optimal choice of a sampling space S is Q itself. Such a 
choice requires N = K samples and yields Jg ont = J e amp . 
Of course, such an occurrence is not possible if the sampling 
process contributes additional noise to the measurements. 

In some cases, it may be difficult to implement a set of 
sampling kernels spanning the subspace Q. It may then be 
desirable to choose a A'-dimensional subspace S which is 
close to Q but does not equal it. We will now compute the 
CRB for this setting and demonstrate that it can be achieved 
by a practical estimation technique. This will also demonstrate 
achievability of the CRB in the special case S — Q. We first 
note from (2) and (48) that dhg/89 = G, where G is the 
set transformation corresponding to the generators {gk\k=i- 
Furthermore, it follows from (49) that S*G and G*S are 
invertible K x K matrices [36]. Using Theorem 3, we thus 
find that the CRB is given by 



MSE(x, x) > ct 2 Tr(G (G*S(S*S)~ 1 S*G) 1 G* 



a'i Tr(G(S*G)- 1 S*S(G*S)~ 1 G*^ 



(61) 



It is readily seen that when S = Q, the bound (61) reduces 
to Act 2 , which is (as expected) the continuous-time bound 
of Theorem 2. When S / 5, the bound (61) will generally 
be higher than Act 2 , since J e amp of (60) will exceed Jjj ont of 
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(36). In this case, it is common to use the consistent, unbiased 
estimator [15], [36] 

x = G(S*G)- 1 c. (62) 

As we now show, the bound (61) is achieved by this estimator. 
Indeed, observe that c = S*y = S*G0 + S*w, and thus 

E{\\x-x\\l 2 }=E{\\G(S*G)- 1 S*iv\\l 2 } 

= E{Tr(G{S*G)- 1 S*ww* S(G* S^G*)} 

= Tr(G(S*G)- 1 Cov(S*w)(G*S)- 1 G*) . 

(63) 

Note that Cov(5*w;) = Cov(c), which by (55) is equal to 
a^S*S. Substituting this result into (63) and comparing with 
(61) verifies that x achieves the CRB. 

D. Nyquist-Equivalent Sampling 

We refer to situations in which the dimension of the 
sampling space equals the dimension of the signal space 
as "Nyquist-equivalent" sampling schemes. In the previous 
section, we saw that Nyquist-equivalent sampling is possible 
using K samples when the signal lies in a A'-dimensional sub- 
space X, and that the resulting system achieves the continuous- 
time CRB. A similar situation occurs when the set of possible 
signals X is a subset of an Af -dimensional subspace Ai of 
L2 with M > K. In this case, it can be readily shown that 
K{dh g /dO) C M. Thus, by choosing N = M sampling 
kernels such that S = A4, we again achieve Jg ont = Jg amp , 
demonstrating that all of the information content in x has been 
captured by the samples. This is again a Nyquist-equivalent 
scheme, but the number of samples it requires is higher than 
the number of parameters K defining the signals. Therefore, 
in this case it is not possible to sample at the rate of innovation 
without losing some of the information content of the signal. 

In general, the constraint set X will not be contained in 
any finite-dimensional subspace of L2. In such cases, it will 
generally not be possible to achieve the performance of the 
continuous-time bound using any finite number of samples, 
even in the absence of digital noise. This implies that in the 
most general setting, sampling above the rate of innovation can 
often improve the performance of estimation schemes. This 
conclusion will be verified by simulation in Section IX. 

VIII. Optimal Sampling Kernels: A Bayesian 
Viewpoint 

In this section, we address the problem of designing a sam- 
pling method which minimizes the MSE. One route towards 
this goal could be to minimize the sampled CRB of Theorem 3 
with respect to the sampling space S. However, the CRB is 
a function of the unknown parameter vector 6. Consequently, 
for each value of 6, there may be a different sampling space 

5 which minimizes the bound. To obtain a sampling method 
which is optimal on average over all possible choices of 6, we 
now make the additional assumption that the parameter vector 

6 is random and has a known distribution. Our goal, then, is 
to determine the sampling space S that minimizes the MSE 

— x\\ 2 L } within a class of allowed estimators. Note that 



the mean is now taken over realizations of both the noise w(t) 
and the parameter 6. 

Since 8 is random, the signal x(t) is random as well. 
To make our discussion general, we will derive the optimal 
sampling functions for estimating a general random process 
x(t) (not necessarily having realizations in X of (5)) from 
samples of the noisy process y(t) = x(t) + w(t). We will 
then specialize the results to several specific types of FRI sig- 
nals and obtain explicit expressions for the optimal sampling 
kernels in these scenarios. 

Let x(t) denote a zero-mean random process defined over 
t <G [0, To], and suppose that its autocorrelation function 

Rx{t,rj) ±E{x(t)x*(ri)} (64) 

is continuous in t and 77. Our goal is to estimate x(t) based 
on a finite number N of samples of the signal y(t) = x(t) + 
w(t), t G [0, To], where w(t) is a white noise process (not 
necessarily Gaussian) with variance a 2 c which is uncorrected 
with x(t). We focus our attention on linear sampling schemes, 
i.e., we assume the samples are given by 

c n = (y, s n ) . (65) 

Finally, we restrict the discussion to linear estimation meth- 
ods, namely those techniques in which the estimate £{t) is 
constructed as 

N 

x(t) = 2J c n v n (t), (66) 

n—l 

for some set of reconstruction functions {v n (t)}^ =x . It is 
important to note that for any given set of sampling functions 
{s n (t)}%=i, the minimum MSE (MMSE) estimator of x(t) 
is often a nonlinear function of the measurements {c n }^ =1 . 
Indeed, typical FRI reconstruction techniques involve a non- 
linear stage. Consequently, restricting the discussion to linear 
recovery schemes may seem inadequate. However, this choice 
has two advantages. First, as we will see, the optimal linear 
scheme is determined only by the second-order statistics of 
x(t) and w(t), whereas the analysis of nonlinear methods 
necessitates exact knowledge of their entire distribution func- 
tions. Second, it is not the final estimate x(t) that interests 
us in this discussion, but merely the set of optimal sampling 
functions. Once such a set is determined, albeit from a linear 
recovery perspective, it can be used in conjunction with exist- 
ing nonlinear FRI techniques. As we will see in Section IX, 
the conclusions obtained through our analysis appear to apply 
to FRI techniques in general. Under the above assumptions, 
our goal is to design the sampling kernels {s n (t)}^ =1 and 
reconstruction functions {v n (t)}n=i sucn mat me MSE (8) is 
minimized. 

As can be seen from (65), we assume henceforth that only 
continuous-time noise is present in the sampling system. The 
situation is considerably more complicated in the presence of 
digital noise. First, without digital noise, one must choose 
only the subspace spanned by the sampling kernels, as the 
kernels themselves do not affect the performance; this is 
no longer the case when digital noise is added. Second, 
digital noise may give rise to a requirement that a particular 
measurement be repeated in order to average out the noise. 
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This is undesirable in the continuous noise regime, since 
the repeated measurement will contain the exact same noise 
realization. 

A. Relation to the Karhunen-Loeve Expansion and Finite- 
Dimensional Generalizations 

The problem posed above is closely related to the 
Karhunen-Loeve transform (KLT) [37], [38], which is con- 
cerned with the reconstruction of a random signal x(t) from 
its noiseless samples. Specifically, one may express x(t) in 
terms of a complete orthonormal basis {ipk(t)}'^ 1 for L2 as 



s (t) = 5^<x,^ fc )^fc(*)- 



(67) 



k=l 



The goal of the KLT is to choose the functions {V'fcWIfcLi 
such that the MSE resulting from the truncation of this series 
after N terms is minimal. It is well known that the solution 
to this problem is given by the TV-term truncation of the 
Karhunen-Loeve expansion [37], [39]. 

Since Rx(t, v) is assumed to be continuous in our setting, 
by Mercer's theorem [39] it possesses a discrete set of eigen- 
functions {V'feWlfeLii which constitute an orthonormal basis 
for Li. These functions satisfy the equations 

AfcVfc(*)= / Rx(t,v)Mv)dv, (68) 
Jo 

in which the corresponding eigenvalues Ai > A2 > • ■ ■ > 
are nonnegative and are assumed to be arranged in descending 
order. With these functions, (67) is known as the Karhunen- 
Loeve expansion. It can be easily shown that the first N terms 
in this series constitute the best TV-term approximation of x(t) 
in an MSE sense [39]. In other words, in the noiseless case, 
the optimal sampling and reconstruction functions are s„ (t) = 
v n (t) = i'n(t)- 

In our setting, we do not have access to samples of x(t) 
but rather only to samples of the noisy process y(t). In this 
case, it is not clear a priori whether the optimal sampling 
and reconstruction filters coincide or whether they match the 
Karhunen-Loeve expansion of x{t). 

The finite-dimensional analogue of our problem, in which 
x, y, and w are random vectors taking values in C M , was 
treated in [40], [41]. The derivation in these works, however, 
relied on the low-rank approximation property of the singular- 
value decomposition (SVD) of a matrix. The generalization 
of this concept to infinite-dimensional operators is subtle and 
will thus be avoided here. Instead, we provide a conceptually 
simple (if slightly cumbersome) derivation of the optimal 
linear sampling and reconstruction method for noisy signals. 
As we will see, it still holds that s n (t) = i/j„(t), but 
v n (t) = otn4>n{t), where a„ is a shrinkage factor depending 
on the SNR of the nth sample. 

B. Optimal Sampling in Noisy Settings 

As explained in Section VII, in the absence of discrete-time 
noise, the MSE is not affected by modifications of the sam- 
pling kernels which leave the set S — span{si(i), . . . , sjv(t)} 



unchanged. Thus, without loss of generality, we constrain 
{ s «W}n=i to satisfy 



s n ,v c s m +J Rx(-,T)s m (T)dr J = 8 m . n (69) 



for every m,n = 1, . . . , N. This can always be done since 
the operator Ry : £2 — > L-2 defined by = 
Jq Rx(t, T)f{r)dT + (Tcf(t) is positive definite. This choice 
is particularly convenient as it results in a set of uncorrected 
samples {c n }. Indeed 



E{«;}=eJ|| s* m (T) y (T)drj ^ s* n (r,)y(r,)d V J 
s* m (T)E{y(T)y*{r])} s n (r))dTdr) 
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s rn( T ) R x(T,i])s n (r))dTdr] 

+ E{(s n ,w) (s m ,w)*} 
s * n ( T ) R x(v, T)s n (r))dTdr) + a\ (s„, s m ) 



(70) 



We are now ready to determine the optimal sampling 
method. We begin by expressing the MSE (8) as 



E\\x(t)-x(t)\ 2 \dt= / E{\x(t)\ 2 }dt 
' Jo 

-2 f n{E{x*{t)x(t)}}dt+ f E{\x{t)\ 2 } dt. 

(71) 



The first term in this expression does not depend on the choice 
of {s n (t)}n=i an d { v n(t)}n=i> an< ^ ^ s therefore irrelevant for 
our purpose. Substituting (66) and (65), and using the fact that 
w(t) is uncorrelated with x(t), the second term can be written 

as 

jf 2»|E|x*(t)5^c n «„(*)||(ft 

N T 

= J2 2 MMx*(t)(x(T)+w(T))}s* n (T)v n (t)}dTdt 

n=i JJ n 

N T 

= Y. 2 n{s* n {T)R x (r,t)v n (t)}dTdt 



S2»j/«n,j£ i? x (-,r)s n (r)dr\| 



(72) 



Similarly, using the fact that {c„}„ =1 are uncorrelated and 
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have unit variance (see (70)), the last term in (71) becomes 



E< 



iY 



y~l c n v n (t) 



n=l 



N N 



dt = E E E { c m c «} { v m, V„) 



m — 1 n— 1 

N 

Eiwi 



(73) 



Substituting (72) and (73) back into (71), we conclude that 
minimization of the MSE is equivalent to minimization of 

(\\v„\\ 2 -2$lUv n ,^ R X (;T)s n (T)dT^jj (74) 

with respect to {s n (t)}^ =1 and {v n (t)}^ =1 , subject to the set 
of constraints (69). 

As a first stage, we minimize (74) with respect to the 
reconstruction functions {v n (t)}^ =1 . To this end, we note that 
the nth summand in (74) is lower bounded by 



-2|H 



Rx(-,T)s n (T)d,T 



> 



Rx(-,T)s n (T)dT 



(75) 



where we used the Cauchy-Schwarz inequality and the fact 
that mm z {z 2 — 2bz} = —b 2 . This bound is achieved by 
choosing 



v n (t) 



Rx(t,T)s n (r)dT, 



(76) 



thus identifying the optimal reconstruction functions. 

Substituting (76) into (74), our goal becomes to maximize 

2 



N 

E 



Rx(-,T)s n (T)dT 



(77) 



with respect to the sampling functions {s n (t)}^ =1 . As we 
show in Appendix B, the maximum of this expression is 
achieved by any set of kernels of the form 



N 



(78) 



fc=i 



where A is a unitary N x N matrix and Afc and ipk (t) are the 
eigenvalues and eigenfunctions of R x (t,T]) respectively (see 
(68)). In particular, we can choose A = Ijv, leading to 

Sn (t) = —L=^ n {t), n=l,...,N. (79) 
y/X n + o* 

From (76), the optimal reconstruction kernels are given by 

An 



V n (t) 



:il> n (t), n = l,...,N. 



The following theorem summarizes the result. 



(80) 



Theorem 4. Let x(t), t 6 [0, T] be a random process whose 
autocorrelation function Rx(t,rj) is jointly continuous in t 
and r\. Assume that y(t) = x{t) + w(t), where w(t) is a 



white noise process uncorrelated with x(t). Then, among all 
estimates x(t) of x(t) having the form 



x(t) 



N 
n=l 



v n{t) / s* n {T)y(r)dt 



(81) 



the MSE (8) is minimized with {s n (t)}n=i an d {v n {t)}n=i °f 
(79) and (80) respectively. In these expressions, X n and tp n (t) 
are the eigenvalues and eigenfunctions ofR x (t, if) respectively 
(see (68)). 

Interestingly, the optimal sampling and reconstruction func- 
tions in our noisy setting are similar to those dictated by the 
KIT". The only difference is that in the present scenario, the 
nth sample is shrunk by a factor of A„/(A„ + a 2 ) prior to 
reconstruction. This ensures that the low-SNR measurements 
do not contribute to the recovery as much as their high-SNR 
counterparts. From the viewpoint of designing the sampling 
mechanism, however, this difference is of no importance. 

As stated above, in practice one would generally favor non- 
linear processing of the samples (namely, applying standard 
nonlinear FRI techniques) rather than a simple element-wise 
shrinkage. Thus, the importance of Theorem 4 for our purposes 
is in identifying that the eigenfunctions of Rx(t, r) remain the 
optimal sampling kernels even in the noisy setting. 

C. Example: Sampling in a Subspace 

To demonstrate the utility of Theorem 4, we now revisit the 
situation in which x{t) is given by (38) for some set of lin- 
early independent functions {gk(t)}f =1 spanning a subspace 
Q € L2- We assume that the coefficients = {a\ : . . . ,ojf} T 
form a zero-mean random vector and denote its autocorrela- 
tion matrix by Rg. In this case, the signal's autocorrelation 
function is given by 



Rx{t,r,)=E{x(t)x*{r,)} 



K 



K 



{k=i t=i ) 

K K 

EE^W.9£*(«)(Re)M- (82) 



fc=i t=i 



Consequently, the operator Rx ■ Li — > L2 defined by 

(R x h)(t) = J Q R x (t,i])h(i])dri can be expressed as 



Rx — GUgG* , 



(83) 



where G is the set transformation (2) associated with {gk}k=v 
Now, let U be a unitary matrix and let D be a diagonal 
matrix, such that 



UDU* = (G*G) 1/2 K e (G*G) 1 



/2 



(84) 



Since the dimension of TZ(G) is K, the operator Rx has at 
most K nonzero eigenvalues {\k]u=i- Let * denote the set 
transformation associated with the N eigenfunctions {ip n }n=i 
corresponding to the N largest eigenvalues, for some N < K. 
Then, it can be shown that 



* = G(G*Gy 1/2 V 



(85) 
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and the corresponding eigenvalues are 

A n = D„ )fl . (86) 

To see this, note that according to (85), is an isometry, since 

= U*{G*G)- 1/2 G*G{G*G)- 1/2 XJ = U*U = l K . 

(87) 

Furthermore, (83) and (84) imply that R x = *D**. Conse- 
quently 

Rx^ = ^Dtf** = I'D, (88) 

which proves the claim. 

It is important to emphasize that the K functions 
{4>n{t)} n=i s P an S- Therefore, if one is allowed to take 
N = K samples, then the optimal choice is a set of kernels 
that span Q. This conclusion is compatible with the CRB 
analysis of the previous sections. However, the advantage 
of the Bayesian viewpoint is that it allows us to identify 
the optimal sampling space when less than K samples are 
allowed. For example, suppose that {g n (t)} are orthonormal, 
and the coefficients {a n } are uncorrected. Then the optimal 
sampling space is the one spanned by the N functions {g n (t)} 
corresponding to the N largest-variance coefficients {a n }. 

A second example demonstrating the derivation of the 
optimal sampling kernels will be given in the next section. 

IX. Application: Channel Estimation 

In this section, we focus on a specific application of FRI 
signals, namely, that of estimating a signal consisting of a 
number of pulses having unknown positions and amplitudes 
[5]-[7]. More precisely, we consider periodic signals x(t) of 
the form (16), which were discussed in Section III-C. These 
are T-periodic pulse sequences, in which the pulse shape 
g(t) is known, but the amplitudes {a?} and delays {t?} are 
unknown. After analyzing periodic signals of this type, we 
will also compare estimation performance in this case with 
the semi-periodic family (17), and attempt to explain the 
empirically observed differences in stability between these two 
cases. 

By defining the T-periodic function h(t) = J^nezdi^ ~ 
nT), we can write x(t) of (16) as 

L 

x(t)=J2 a e h (t-ti)- (89) 
e=i 

Our goal is now to estimate x(t) from samples of the noisy 
process y(t) of (6). As before, we will assume that only 
continuous-time noise is present in the system. Since x{t) is 
T-periodic, it suffices to recover the signal in the region [0, T}. 
In particular, we would like to identify the optimal sampling 
kernels for this setting, and to compare existing algorithms 
with the resulting CRB in order to determine when the optimal 
estimation performance is achieved. 
Let 

h k = 7F {h,(p k ), keZ (90) 



be the Fourier series of h(t), where ip^it) — ei 2nkt ' T . The 
Fourier series of x(t) is then given by 



x k 4 -(x,<p k ) =h k Y. a t e ~ j¥ktt > fcGZ - < 91 ) 



Let K, — {k G Z : hk ^ 0} denote the indices of the 
nonzero Fourier coefficients of h(t). Suppose for a moment 
that JC is finite. It then follows from (91) that x(t) also has 
a finite number of nonzero Fourier coefficients. Consequently, 
the set X of possible signals x(t) is contained in the finite- 
dimensional subspace M = sp&n{ipk}ke>c- Therefore, as 
explained in Section VII-D, choosing the N = \JC\ sampling 
kernels {s n (t) = e~-' 27mt < /T }„ e jt results in a sampled CRB 
which is equivalent to the continuous-time bound. This result 
is compatible with recent work demonstrating successful per- 
formance of FRI recovery algorithms using exponentials as 
sampling kernels [6]. 

Note, however, that this is a "Nyquist-equivalent" sampling 
scheme, i.e., the number of samples required N = \JC\ is 
potentially much higher than the number of degrees of freedom 
2L in the signal x(t) (see Section VII-D). This provides 
a theoretical explanation of the empirically recognized fact 
that sampling above the rate of innovation improves the 
performance of FRI techniques in the presence of noise [7], 
a fact which stands in contrast to the noise-free performance 
guarantees of many FRI algorithms. 

Moreover, if there exists an infinite number of nonzero 
coefficients hk, then in general the set X will not belong to 
any finite-dimensional subspace. Consequently, it will not be 
possible in this case for an algorithm based on a finite number 
of samples to achieve the performance obtainable from the 
complete signal y(t). This occurs, for example, whenever the 
pulse g(t) of (16) is time-limited. In such cases, any increase in 
the sampling rate will potentially continue to reduce the CRB, 
although the sampled CRB will converge to the asymptotic 
value of PtqO' 2 in the limit as the sampling rate increases. 



A. Choosing the Sampling Kernels 

An important question in the current setting is how to 
choose the sampling kernels so as to achieve the best possible 
performance under a limited budget of samples. This can 
be done via the Bayesian analysis provided in Section VIII. 
Assume, for example, that the time delays {ti}f =1 are inde- 
pendently drawn from a uniform distribution over the interval 
[0,T]. Furthermore, suppose that the amplitudes {di}f =1 are 
mutually uncorrected zero-mean random variables which are 
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independent of the time delays and have variance a 2 . Then, 

Rx{t,T) = E{x(t)x* (t)} 



L L 



^ ^ E{a k a* t }E{h(t - t k )h*(r - t e )} 

k=l i=l 
L 



1=1 



E{h(t-t t )h*(T-t e )} 



1 



T 



a 2 a L- / h(t-tt)h*(T-tt)dt t 



a 2 a LY,\h\ 2 e 



j^f-k(t-r) 



(92) 



where we used Parseval's theorem. It is easily verified that the 
eigenfunctions of Rx(t,r) are given by 

1 .-ai. 



T 



n G Z 



and the corresponding eigenvalues are 



A„ = La 2 a T\h n \ 2 , neZ. 



Therefore, the optimal set of N sampling functions is 



1,...,N 



(93) 



(94) 



(95) 



where p n is the index of the nth largest Fourier coefficient 
\h Pn \. The optimal linear recovery of x(t) from the resulting 
samples is given by 



E< 



L*Z\~h P J 2 



x(t) = > &.„ 



(96) 



The performance of this estimator is poorer than state-of-the- 
art techniques, due to the restriction to linear reconstruction 
schemes. We recall that this technique is intended only for 
selecting the sampling kernels. 

The above analysis again lends credence to the recently 
proposed time-delay estimation technique of Gedalyahu et al. 
[6], which makes use of complex exponentials as sampling 
functions. A disadvantage of this algorithm is that it can 
only handle a set of exponents with successive frequencies, 
while for general pulses, the indices of the N largest Fourier 
coefficients may be sporadic. As we will see in Section IX-C, 
this limitation may result in deteriorated performance of the 
algorithm in some cases. 

B. Computing the CRB 

Having identified the optimal sampling kernels (95), we 
would now like to compute the CRB for estimating x(t) from 
the resulting samples. In order to compare these results with 
the continuous-time CRB, we assume that no digital noise 
is added in the sampling process. However, the calculations 
described below can be adapted without difficulty to situations 
containing both continuous-time and digital noise. 

We assume for simplicity that h(t) and {ai} are real-valued. 
Nonetheless, the sampling kernels chosen above are complex- 
valued, implying that Theorem 3 cannot be directly applied. 
Yet since h(t) is real-valued, we have \hk\ = \h-k\, and 
consequently the optimal sampling kernels consist of complex 



conjugate pairs e ±J 27r "*/' r Recall that the sampling kernels 
can be changed without affecting the CRB, as long as the sub- 
space they span remains constant. Consequently, the CRB can 
be computed for the equivalent sampling kernels sin(27mi/T) 
and cos(27T7?i/T), which are real and can therefore be used in 
conjunction with the results of Section VII. We note that since 
the transition to these real-valued kernels is unitary, the CRB 
will not change even if digital noise is added. To be specific 
without complicating the notation, we assume that N is odd 
and that the DC component is included among the sampling 
kernels chosen in (95). We can then define the equivalent set 
of kernels 



s (t) = 1, 

s n (t) = cos(2np n t/T), n=l,..., 



N - 1 



2 ' 

5„+«+i(i) - sin(27rp„t/T), n = l,...,—^. (97) 



We further define the parameter vector 

= (ai, . . . , cll, ti, . . . , £_l) t 



(98) 



whose length is K = 2L. 

Theorem 3 provides a two-step process for computing the 
CRB of the signal x(t) from its samples. First, the FIM Jg amp 
for estimating 8 is determined. Second, the formula (51) is 
applied to compute the CRB. While Theorem 3 also provides 
a means for calculating Jg amp , it is more convenient in the 
present setting to derive the FIM directly. This can be done 
by calculating the expectations \i n of (44) and applying (52). 



In our setting, \i % 



\X. s Tl 



are given by 



fin 



n = 0,..., 



N - 1 
N - 1 



V n+ M±i = Xpn 2 * Pn , n = l,...,-- T 1 (99) 

where {x n } are the Fourier coefficients of x(t). These coeffi- 
cients depend in turn on the parameter vector 6, as shown in 
(91). Substituting \i n into (52) yields a closed-form expression 
for J e . Since the resulting formula is cumbersome and not 
very insightful, it is not explicitly written herein. 

To obtain the sampled CRB, our next step is to compute 
the 2L x 2L matrix 



M 



dh e 
80 



dhe 
dO 



(100) 



The function hg : M. 2L — > L-2 maps a given parameter vector 
to the resulting signal x(t) as defined by (89). Differentiating 
this function with respect to 0, we find that the operator 
dh / 36 ■ » 2L -> L 2 is defined by 

Oh \ 

-j± J v = vih(t - h) + ■ ■ ■ + v L h(t - t L ) 

- v L+1 aih'(t - h) v 2 LaLh'(t - t L ) 

(101) 



for any vector v G 



])2L 
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One may now compute the ifcth element of M as 



ik 



dh e 
89 



dhe 
89 



efc 



dhe dhe 
~d9 e *'~d9 



■e/ v 



(102) 



Thus, each element of M is an inner product between two of 
the terms in (101). To calculate this inner product numerically 
for a given function h(t), it is more convenient to use 
Parseval's theorem in order to convert the (continuous-time) 
inner product to a sum over Fourier coefficients. For example, 
in the case 1 < i, k < L, we have 

M lk = { h{t - t t )h* (t - t k )dt 



= Tj2Ke- l2nUn/T h* n e^ n / T 

raGZ 

= T^|/i n | 2 e- j27rfo -^"/ T , l<i,jfc<L. (103) 

TiGZ 

An analogous derivation can be carried out when i or k are 
in the complementary range L + 1, . . . , 2L. 

Finally, having calculated the matrices J^ amp and M, the 
CRB for sampled measurements is obtained using (51). We 
are now ready to compare this bound to the performance of 
practical estimators in some specific scenarios. 

C. Effect of the Pulse Shape 

In Fig. 1, we document several experiments comparing the 
CRB with the time-delay estimation technique of Gedalyahu 
et al. [6]. Specifically, we sampled the signal x(t) of (16) 
using a set of exponential kernels, and used the matrix pencil 
method [42] to estimate x(t) from the resulting measurements. 
Since we are considering only continuous-time noise, applying 
an invertible linear transformation to the sampling kernels has 
no effect on our performance bounds (see Section VII). The 
various kernels suggested in [6] amount to precisely such an 
invertible linear transformation, and the same performance 
bound applies to all of these approaches. Moreover, under 
the continuous-time noise model, it can be shown that these 
techniques also exhibit the same performance. For the same 
reason, the performance reported here is also identical to the 
method of Vetterli et al. [2]. 

In our experiments, a signal containing L = 2 pulses was 
constructed. The delays and amplitudes of the pulses were 
chosen randomly and are given by 



<2l 

0,2 



0.3204, 
0.6063, 



t-2 



0.6678, 
0.9863. 



(104) 



Modifications of these parameters does not appear to sig- 
nificantly affect the reported results, except when the time 
delays are close to one another, a situation which will be 
discussed in depth in Section IX-D. The pulse h(t) consisted 
of \JC\ = 401 nonzero Fourier coefficients at positions K, = 
{-200, . . . , 200}. The CRB is plotted as a function of the 
number of samples N, where the sampling kernels are given 
by s n (t) = e ^ nt / T with n e {- [N/2\ , . . . , [N/2\ }. This is 
done because the matrix pencil method requires the sampling 
kernels to have contiguous frequencies. 
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(a) The pulse g(t) is a filtered Dirac with 401 Fourier coefficients. 
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(b) The pulse g(t) contains 401 nonzero Fourier coefficients which decrease 
monotonically with the frequency. 
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(c) The pulse g(t) is a filtered rect(-) with 401 Fourier coefficients. 

Fig. 1 . Comparison of the CRB and the performance of a practical estimator, 
as a function of the number of samples. 
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In Fig. 1(a), we chose h k = 1 for -200 < k < 200 and 
hk = elsewhere; these are the low-frequency components 
of a Dirac delta function. The noise standard deviation was 
a c = 10~ 5 . In this case, for a fixed budget of TV samples, 
any choice of TV exponentials having frequencies in the range 
—200 < k < 200 is optimal according to the criterion of 
Section VIII. As expected, the sampled CRB achieves the 
continuous-time bound Ku\ when TV > |/C|. However, the 
CRB obtained at low sampling rates is higher by several orders 
of magnitude than the continuous-time limit. This indicates 
that the maxim of FRI theory, whereby sampling at the rate 
of innovation suffices for reconstruction, may not always hold 
in the presence of mild levels of noise. Indeed, if no noise 
is added in the present setting, then perfect recovery can be 
guaranteed using as few as TV = 4 samples; yet even in the 
presence of mild noise, our bounds demonstrate that perfor- 
mance is quite poor unless the number of samples is increased 
substantially. This result may provide an explanation for the 
previously observed numerical instability of FRI techniques 
[2], [7]. 

As a further observation, we note that in this scenario, 
existing algorithms come very close to the CRB. Thus, the 
previously observed improvements achieved by oversampling 
are a result of fundamental limitations of low-rate sampling, 
rather than drawbacks of the specific technique used. 

The same experiment is repeated in Fig. 1(b) with a pulse 
having Fourier coefficients hk = 1/(1 + O.Olfc 2 ). Since the 
Fourier coefficients decrease with \k\, in this case our choice 
of low-frequency sampling kernels is optimal. However, the 
SNR of the measurements c„ decreases with n. As can be 
seen, this has a negative effect on the performance of the 
algorithm, which is not designed for high noise levels. Indeed, 
including low-SNR measurements causes the MSE not only 
to depart from the CRB, but eventually even to increase as 
more noisy samples are provided. In other words, one would 
do better to ignore the high-frequency measurements than to 
feed them to the recovery algorithm. Yet information is clearly 
present in these high-frequency samples, as indicated by the 
continual decrease of the CRB with increasing TV. Thus, our 
analysis indicates that improved estimation techniques should 
be achievable in this case, in particular by careful utilization 
of low-SNR measurements. 

The adverse effect of low-SNR measurements is exacerbated 
if, for a given TV, one does not choose the TV largest Fourier 
coefficients. This is demonstrated in Fig. 1(c). Here, the 
results of a similar experiment are plotted, in which hk = 
Psmc(nP/T), -200 < k < 200. These are the 401 lowest- 
frequency Fourier coefficients of a rectangular pulse having 
width P. In this case, the Fourier coefficients are no longer 
monotonically decreasing with \k\. Consequently, the sampling 
kernels s n (t) = e j2nnt / T with n G {- [TV/2J , . . . , [N/2\} 
do not correspond to the TV largest Fourier coefficients, and 
thus are not optimal. In particular, for the chosen parameters, 
1^25 1 = | ft— 25 1 are considerably smaller than the rest of the 
coefficients. When TV > 50, the corresponding measurements 
are included, causing the MSE to deteriorate significantly. 
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Fig. 2. Comparison between the CRB and the performance of a practical 
estimator as a function of the pulse positions. The signal contains L = 2 
pulses, the first of which is located at ii = 0.5. The MSE is plotted as a 
function of the position of the second pulse. 




(a) The spacing between the pulses is 0.04. 




(b) The spacing between the pulses is 0.01. 
Fig. 3. Demonstration of the different levels of overlap between pulses. 

D. Closely-Spaces Pulses 

It is well-known that the estimation of pulse positions 
becomes ill-conditioned when several of the pulses are located 
close to one another. Intuitively, this is a consequence of the 
overlap between the pulses, which makes it more difficult to 
identify the precise location of each pulse. However, our goal 
is to estimate the signal x(t) itself, rather than the positions 
of its constituent pulses. As we will see, for this purpose the 
effect of closely-spaced pulses is less clear-cut. 

To study the effect of pulse position on the estimation error, 
we used a setup similar to the one of Fig. 1(b), with the 
following differences. First, a higher noise level of a c = 10 -3 
was chosen. Second, the signal consisted of L = 2 pulses, 
with the first pulse at position t\ = 0.5. The position of the 
second pulse was varied in the range [0.3, 0.7] to demonstrate 
the effect of pulse proximity on the performance. The setting 
was otherwise identical to that of Section IX-C. In particular, 
recall that T = 1. 

The results of this experiment are plotted in Fig. 2, which 
documents both the values of the sampled CRB and the actual 
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MSE obtained by the estimator of Gedalyahu et al. [6]. The 
continuous-time CRB is also plotted, although, as is evident 
from Theorem 2, this bound is a function only of the number of 
parameters determining the signal, and is therefore unaffected 
by the proximity of the pulses. 

Several different effects are visible in Fig. 2. First, as the two 
pulses begin to come closer, both the CRB and the observed 
MSE increase by several orders of magnitude; this occurs 
when \ti — *2 1 is between about 0.15 and 0.03. (Of course, 
the precise distances at which these effects occur depend 
on the pulse width and other parameters of the experiment.) 
This level of proximity is demonstrated in Fig. 3(a). At this 
stage, the overlap between the pulses is sufficient to make it 
more difficult to estimate their positions accurately, but the 
separation between the pulses is still large, so that they are 
not mistaken for a single pulse. 

As the pulses draw nearer each other, they begin to resemble 
a single pulse located at (t\ +^)/2 (see Fig. 3(b)). Depending 
on the noise level, at some point the estimation algorithm will 
indeed identify the two pulses as one. Since our goal is to 
estimate x(t) and not the pulse positions, such an "error" 
causes little deterioration in MSE. This is visible in Fig. 2 
as the region in which the MSE of the practical algorithm 
ceases to deteriorate and ultimately decreases. 

Interestingly, the CRB does not capture this improvement in 
performance. This failure is due to the fact that the CRB ap- 
plies only to unbiased estimators, while the strategy utilized in 
[6] becomes biased for closely-spaced pulses. For an estimator 
to be unbiased, it is required that the mean estimate, averaged 
over noise realizations, will converge to the true value of x(t), 
which has a form similar to that of Fig. 3(b). The expectation 
of an estimator reconstructing a single pulse will not have the 
form of two closely-spaced pulses; such an estimator is thus 
necessarily biased. In other words, the discrepancy observed 
here results from the fact that in this case, biased techniques 
outperform the best unbiased approach. 

E. Non-Periodic and Semi-Periodic Signal Models 

As we have seen above, the reconstruction of signals of the 
form (16) in the presence of noise is often severely hampered 
when sampled at or slightly above the rate of innovation. 
Rather than indicating a lack of appropriate algorithms, in 
many cases this phenomenon results from fundamental limits 
on the ability to recover such signals from noisy measure- 
ments. A similar effect was demonstrated [7] in the non- 
periodic (or finite) pulse stream model (14). In fact, if one is 
allowed to sample a non-periodic pulse stream with arbitrary 
sampling kernels, then by designing kernels having sufficiently 
large time-domain support, one can capture all or most of the 
energy in the signal. This setting then essentially becomes 
equivalent to a periodic signal model (16) in which the period 
is larger than the effective support of the pulse stream: One can 
imagine that the signal repeats itself beyond the sampled re- 
gion, as this would not affect the measurements. Consequently, 
it is not surprising that the non-periodic model demonstrates 
substantial improvement in the presence of oversampling [7]. 

On the other hand, some types of FRI and union of subspace 
signals exhibit remarkable noise resilience, and do not appear 
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Fig. 4. Comparison between the CRB for a periodic signal (16) and a semi- 
periodic signal (17). 



to require substantial oversampling in the presence of noise 
[5], [24]. As we now show, the CRB can be used to verify that 
such phenomena arise from a fundamental difference between 
families of FRI signals. 

As an example, we compare the CRB for reconstructing the 
periodic signal (16) with the semi-periodic signal (17). Recall 
that in the former case, each period consists of pulses having 
unknown amplitudes and time shifts. By contrast, in the latter 
signal, the time delays are identical throughout all periods, but 
the amplitudes can change from one period to the next. 

While these are clearly different types of signals, an ef- 
fort was made to form a fair comparison between the re- 
construction capabilities in the two cases. To this end, we 
chose an identical pulse g(t) in both cases. We selected the 
signal segment [0, Tn], where Tq = 1, and chose the signal 
parameters so as to guarantee an identical To-local rate of 
innovation. We also used identical sampling kernels in both 
settings: specifically, we chose the kernels (97) which measure 
the N lowest frequency components of the signal. 

To simplify the analysis and focus on the fundamental 
differences between these settings, we will assume in this 
section that the pulses g(t) are compactly supported, and that 
the time delays are chosen such that pulses from one period do 
not overlap with other periods. In other words, if the support 
of g(t) is given by [t a ,tb\, then we require 



min{^} > t a and max{i^} < T 



(105) 



Specifically, we chose the pulse g(t) used in Fig. 1(b), which 
is compactly supported to a high approximation. 

For the periodic signal, we chose L = 10 pulses with 
random delays and amplitudes, picked so as to satisfy the 
condition (105). A period of T = 1 was selected. This implies 
that the signal of interest is determined by 2L = 20 parameters 
(L amplitudes and L time delays). 

To construct a semi-periodic signal with the same number 
of parameters, we chose a period of T = 1/9 containing 
L = 2 pulses. The segment [0, To] then contains precisely 
M = 9 periods, for a total of 20 parameters. While it may 
seem plausible to require the same number of periods for 



19 



both signals, this would actually disadvantage the periodic 
approach, as it would require the estimation of much more 
closely-spaced pulses. 

The CRB for the periodic signal was computed as explained 
in Section IX-B, and the CRB for the semi-periodic signal can 
be calculated in a similar fashion. The results are compared 
with the continuous-time CRB in Fig. 4. Note that since the 
number of parameters to be estimated is identical in both signal 
models, the continuous-time CRB for the two settings coin- 
cides. Consequently, for a large number of measurements, the 
sampled bounds also converge to the same values. However, 
when the number of samples is closer to the rate of innovation, 
the bound on the reconstruction error for the semi-periodic 
signal is much lower than that of the periodic signal. As 
mentioned above, this is in agreement with previously reported 
findings for the two types of signals [2], [5], [6]. 

To find an explanation for this difference, it is helpful to 
recall that both signals can be described using the union of 
subspaces viewpoint (see Section III-C). Each of the signals 
in this experiment is defined by precisely 20 parameters, which 
determine the subspace to which the signal belongs and the 
position within this subspace. Specifically, the values of the 
time delays select the subspace, and the pulse amplitudes 
define a point within this subspace. Thus, in the above set- 
ting, the periodic signal contains 10 parameters for selecting 
the subspace and 10 additional parameters determining the 
position within it; whereas for the semi-periodic signal, only 
2 parameters determine the subspace while the remaining 
18 parameters set the location in the subspace. Evidently, 
identification of the subspace is challenging, especially in 
the presence of noise, but once the subspace is determined, 
the remaining parameters can be estimated using a simple 
linear operation (a projection onto the chosen subspace). 
Consequently, if many of the unknown parameters identify the 
position within a subspace, estimation can be performed more 
accurately. This may provide an explanation for the difference 
between the two examined signal models. 

As further evidence in support of this explanation, we recall 
from Section III-C that the multiband signal model (19) can 
also be viewed as a union of subspaces. Here, again, the 
parameters {toe}f =1 determining the subspace (i.e., the utilized 
frequency bands) are far fewer than the parameters {(^[n]} 
selecting the point within the subspace (i.e., the content of 
each frequency band). In support of the proposed explanation, 
highly noise resistant algorithms can be constructed for the 
recovery of multiband signals [24], [26]. An even more 
extreme case is the single subspace setting, exemplified by 
shift-invariant signals (Section III-A). In this case, all of the 
signal parameters are used to determine the position within the 
one possible subspace. As we have seen in Section VII-C, in 
this case Nyquist-equivalent sampling at the rate of innovation 
achieves the continuous-time CRB. 

X. Conclusion 

In this paper, we studied the inherent limitations in recov- 
ering FRI signals from noisy measurements. We derived a 
continuous-time CRB which provides a lower bound on the 



achievable MSE of any unbiased estimation method, regardless 
of the sampling mechanism. We showed that the rate of 
innovation pr is a lower bound on the ratio between the 
average MSE achievable by any unbiased estimator and the 
noise variance o\, regardless of the sampling method. This 
stands in contrast to the noise-free interpretation of px as the 
minimum sampling rate required for perfect recovery. 

We next examined the CRB for estimating an FRI signal 
from a discrete set of noisy samples. We showed that the 
sampled bound is in general higher than the continuous-time 
CRB, and approaches it as the sampling rate increases. In 
general, the rate which is needed in order to achieve the 
continuous-time CRB is equal to the rate associated with 
the smallest subspace that encompasses all possible signal 
realizations. In particular, if a signal belongs to a union of 
subspaces, then the rate required to achieve the continuous- 
time bound is that associated with the sum of the subspaces. 
In some cases, this rate is finite, but in other cases the sum 
covers the entire space L2 and no finite-rate technique achieves 
the CRB. 

A consequence of these results is that oversampling can 
generally improve estimation performance. Indeed, our exper- 
iments demonstrate that sampling rates much higher than px 
are required in certain settings in order to approach the optimal 
performance. Furthermore, these gains can be substantial: In 
some cases, oversampling can improve the MSE by several 
orders of magnitude. We showed that the CRB can be used to 
determine which estimation problems require substantial over- 
sampling to achieve stable performance. As a rule of thumb, 
it appears that for union of subspace signals, performance is 
improved at low rates if most of the parameters identify the 
position within the subspace, rather than the subspace itself. 
Our analysis can also be used to identify cases in which no 
existing algorithm comes close to the CRB, implying that 
better approaches can be constructed. In particular, it seems 
that existing algorithms do not deal well with measurement 
sets having a wide dynamic range. 

Lastly, we addressed the problem of choosing the sampling 
kernels. This was done by adopting a Bayesian framework, so 
that an optimality criterion can be rigorously defined. Using a 
generalization of the KLT, we showed that the optimal kernels 
are the eigenfunctions of the autocorrelation function of the 
signal. In the context of time-delay estimation, these kernels 
are exponentials with appropriately chosen frequencies. This 
choice coincides with recent FRI techniques [5]. 

Appendix A 
Proof of Theorem 1 

The following notation will be used within this appendix. 
Let Hi and H2 be two measurable Hilbert spaces, and 
let (fi, J?, P) be a probability space. Consider two random 
variables u : Q, — > Hi and v : Q. — > Ti.2- Then, the notation 
E{uu*} will be used to denote the linear operator H2 — > Hi 
such that, for any hi € Hi and h.2 £ H2, 

(hi,E{uv*}ha} Hl =mhi,u) Hl (v,h 2 ) n2 } (106) 

if the expectation exists for all hi and /i2- 
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We begin by stating two general lemmas which will be of 
use in the proof of Theorem 1 . 

Lemma 1. Let Hi and H2 be two Hilbert spaces, and 
consider the operators 



A : Hi 
B:H 2 
C:H 2 



Hi, 
Hi, 
Hi. 



(107) 



Suppose C is self-adjoint and invertible. Define the product 
Hilbert space Hi x H2 in the usual manner, and suppose the 
operator M : Hi x H2 —> Hi x H2 defined by 



M 



Ahi - 
B*hi 



Bh 2 
-Ch 2 



is positive semidefinite (psd). Then, 

A >- BC~ X B* 



(108) 



(109) 



in the sense that the Hi — > Hi operator A — BC X B* is psd. 

Proof: Since M is psd, we have for any hi £ Hi and 

h 2 £ H 2 



«ixH 2 



which implies 

(hi,Ahi) Hi 
Choosing h 2 : 



25R [{h x ,Bh 2 ) U2 ] + (b,Cb) n2 > 0. (Ill) 



—C~ 1 B*hi, we have that (hi, Bh,2) Uo = 
- (B*hi, C~ 1 B*hi) n , which is real since C _1 is self- 
adjoint. It follows from (111) that 

(h u Ahi) Hi - (h^BC-'B^i)^ > (112) 

which leads to (109), as required. ■ 

Lemma 2. Let Hi and H2 be two Hilbert spaces and let 
(VL, & , P) be a probability space. Let u : D, — > Hi and v : 
— > H2 be random variables, and suppose the expectations 
E{ww*}, E{uw*}, and E{w*} exist as linear operators as 
defined in (106). IflE{vv*} is invertible, then 



E{uu*} y E{uv*} (E{w*}) _1 E{vu*} 



(113) 



Proof: Let us denote A = E{uu*}, B = E{uv*}, and 
C = E{vv*} and define the linear operator M : Hi x H2 —> 
Hi x H 2 as in (108). From (106), for any h x £ Hi and 
h,2 £ H2 we have 



M 



hi 



h 2/ 

= (hi,Ahi) Hi + 23? [(hi,Bh 2 ) ni ] + (b, Cb) H2 
= E{ \(hi,u) Hi \ 2 + 23? [(hi,u) Hi (v, h 2 ) H2 ] 

+ \(h2,v) n2 \ 2 } 

= E^\(hi,u) ni + (w,/i 2 ) W2 | 2 | 

>0. (114) 

Thus M is a psd operator. Invoking Lemma 1 yields (113), as 
required. ■ 



We are now ready to prove Theorem 1. 

Proof of Theorem 1: Throughout the proof, let 9 be 
a fixed parameter and consider all functions as implicitly 
dependent on 6. Define the random variables 



u : D, 
v : fl 



H : 

TOK 



u(u>) = x(y(uj)) - h e , 



(115) 



uH = siog^M) (U6) 



We then have the linear operators E{w*} : ~M. K — > M. K , 
E{uu*} :H^H, and E{uv*} : M. K — > H, which satisfy 



E{vv*}=3 e , 

((Pi,E{uu*} ipj) =E{((fi,u) (u,ipj)}, 

d log p e {y) 



((fii,E{uv*} ej) =E<^ (cpi,u) 



86., 



(117) 
(118) 

(119) 



where {(p n }n£Z denotes a complete orthonormal basis for H. 
The operator E{uu*} can be thought of as the co variance of 
x, and is well-defined since, by (32), x has finite variance. 
Indeed, we have 

53( Vij E{uu*}vJi> =E{||x-/i e ||| 2 } <oo (120) 

so that E{uu*} is not only well-defined, but a trace class 
operator. Furthermore, E{i>v*} = Jg is well-defined and 
invertible by Assumption P5. The operator E{ww*} is thus 
also well-defined by virtue of the Cauchy-Schwarz inequality. 
To prove the theorem, we will show that 



} - W 



(121) 



and then obtain the required result by applying Lemma 2. To 
demonstrate (121), observe that 

d l og pejy) 
89, 



(^i,E{m;*}e,) =E (ip t ,u) 



(ipi,x(y) - he) 



1 dp(y;9) 
p(y; 9) 89, 



p(y;9)Pe (dy) 



= I (<Pi,x(y) - he) Um 



P (y;9 + A ej )-p(y;9) 
A 



Pe a (dy). 
(122) 



By Assumption P4, for any sufficiently small A > we have 

.p(y,e + AB j )-p(y;e) 



((Pi,x(y) - 
< \(ipi,x(y) - h g )\q(y,9). 



(123) 



Let us demonstrate that the right-hand side of (123) is abso- 
lutely integrable. By the Cauchy-Schwarz inequality, 



I ('fit, Kv) - h e) I q(y, 0)Pe a (dy) 



< 



\((fi,x(y) - h e )\ Pe (dy) ■ / q 2 (y,9)Pe (dy) 



(124) 
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The rightmost integral in (124) is finite by virtue of (29). As APPENDIX B 

for the remaining integral, we have MAXIMIZATION OF (77) 

/The task of maximizing (77) is most easily accomplished 

\(<Pi,x(y)-h e )\ Pe (dy) by optimizing the coordinates of s n (t) in the orthonormal 

/basis for L2 [0, T] generated by the eigenfunctions of Rx (t, 77). 

\\x(y) — he\\ 2 Pe {dy) Specifically, the function s n (t) can be written as 

(a) f 00 

< (\\i(y)\\ + \\he\\) 2 Pe (dy) s n (t) = E < (Afc + <>l) ~ h 4>k (t), (132) 

k— 1 

(b) r 2 r 

< J \\Hv)\\ p e a {dy) + \\h g \\ J Pe a {dy) with {i)k{t)}f =l and {\k)T=i ° f ( 68 )- ( The coefficients 

. \ 1/2 (Afc + u 2 )~ 1/ ' 2 are inserted since they simplify the subsequent 

+ 2||/i e || ( / \\x(y)\\ 2 Pg (dy) ] analysis.) Now, by Mercer's theorem, R x {t,r}) can be ex- 

\J / pressed as 

00 

(125) i?x(t,^)=^A^W^W, (133) 



(c) 

< 00 



«=1 



where we have used the triangle inequality in (a), the Cauchy- 
Schwarz inequality in (b), and the assumption (32) that £ has where the convergence is absolute and uniform. Therefore 
finite energy in (c). We conclude that (123) is bounded by pT 
an absolutely integrable function, and we can thus apply the / Rx(t,T)s n (T)dT 
dominated convergence theorem to (122), obtaining 



8 



/ E «* (A* + olY k Mr) E ^MtW(r)dT 
Jo 



(ipi,E{uv*} Bj) = — J (<Pi,x(y))p(y;d)Po (dy) J ° k=i 



d f 

(viM-Qf ] p(y;O)p 0o {d y ). (126) 



E a * ^—rMt), (134) 



fc=l 



T , , . . , . , , j j • • and consequently, by Parseval's theorem, (77) is given by 

The second integral in (126) equals 1 and its derivative is M •" J v ' & 3 

therefore 0. Thus we have n ™ ^ 2 



}ej) = — Q = — ^ . (127) 



T 

Rx(t,T)s n (T)dr 



d9 j de 3 N 



dt 



E 



00 . 

fe=i 



(A fc + <7 2 )* 



On the other hand, note that the Frechet derivative dhg /dO 
of (31) coincides with the Gateaux derivative of tig. In other 
words, for any vector v g we have _ 1 n 1 2 ^fc Q35") 



n=l fc=l 



9he v = lim fee+ e v - h e 
QQ e->o e Similarly, using (132) and (134), we have 

It follows that f f T 



dh g \ _ d(tfi,h 6 
86 e V ~ W) 



/ / C(t)Jix(t,r)« n (T)drdt = 5>£(aj?) 

•'-'0 



(pi '^) = —hT— (136) 

and by (132) 



Since E{uu*} and (dhg/d8)* are both linear operators, (127) 
and (129) imply that the two operators are equal, demonstrat- „ f T J^L a 2 

ing (121). Applying Lemma 2 and using the results (117) and CT c y s mW s »W« = 2^ a fc ) A^+~of ' 



(121), we have 



E^-M^-M*}^^)^- 1 ^) • (130) translated to 



fc=i 

Combining (136) and (137), the set of constraints (69) is 



E«rKr = <w d38) 



As we have seen, the left-hand side of (130) is trace class, and 
thus so is the right-hand side. Taking the trace of both sides 
of the equation, we obtain for ever Y TO > 71 = h---,N. Consequently, our problem has 

now been reduced to 

««*-».■■>>»((£)*(£)') <»» ££ Kf _iL tt £>-«,. _ w 

which is equivalent to (33), as required. ■ (139) 
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We now show that the sequences which solve (139) 

must satisfy ajj = for every k > N and n = 1, . . . , N. To 
see this, assume to the contrary that the nth sequence satisfies 
a™ ^ for some £ > N. We can then replace this sequence 
by a sequence {a^kez satisfying 



\al\ 2 +al 



1 < k < N 

k = e 

N < k and k ^= £ 



A-e: 



(140) 



1). 



where X)fc=i a A = l a ™| 2 ( to ensure that ^ 
Such a set of coefficients {ak}^ =1 can always be found since 
the TV-term truncation of the remaining N — 1 sequences 



cannot span 



With this sequence, the nth summand in 



the objective of (139) becomes 

oo 



*!=i 



A? 



A-l 




(141) 



k=\ 



where we used the fact that > Xi for every k < £ and 
that z 2 /(a + z) is a monotone increasing function of z for all 
z > 0. This contradicts the optimality of {ajj}fcgz. Therefore, 
the set of sequences maximizing (139) satisfy a£ = for 
every k > N and n = 1, . . . , N. 

It remains to determine the optimal values of the first N 
elements of each of the N sequences {a^}kez> n = 1, . . . , N. 
For this purpose, let A denote the NxN matrix whose entries 
are Ak, n = a k ar, d l et A be a diagonal matrix with Ak,k = 
Afe/(Afc + ffj). Then, the constraint (138) can be written as 
A*A = Ijv, which is equivalent to AA* = l N . Now, the 
objective in (139) can be expressed as 



N 



EE 

n=l fc=l 



A?- 



= Tr{A*AA} 

= Tr{AA*A} 
= Tr{A}, 



(142) 



which is independent of A. Therefore, we conclude that any 
set of orthonormal sequences {d^}k<^i, n = 1, . . . , N, whose 
elements vanish for every k > N is optimal. 
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